CN110399222B - GPU cluster deep learning task parallelization method and device and electronic equipment - Google Patents

GPU cluster deep learning task parallelization method and device and electronic equipment Download PDF

Info

Publication number
CN110399222B
CN110399222B CN201910675587.9A CN201910675587A CN110399222B CN 110399222 B CN110399222 B CN 110399222B CN 201910675587 A CN201910675587 A CN 201910675587A CN 110399222 B CN110399222 B CN 110399222B
Authority
CN
China
Prior art keywords
gpu
deep learning
target
task
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910675587.9A
Other languages
Chinese (zh)
Other versions
CN110399222A (en
Inventor
张海涛
耿欣
马华东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910675587.9A priority Critical patent/CN110399222B/en
Publication of CN110399222A publication Critical patent/CN110399222A/en
Application granted granted Critical
Publication of CN110399222B publication Critical patent/CN110399222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Abstract

The GPU cluster deep learning task parallelization method, the device and the electronic equipment provided by the embodiment of the application relate to the technical field of Internet, the similarity of a deep learning task to be processed and each computing node of a GPU cluster is firstly analyzed, a target computing node of the deep learning task to be processed in the GPU cluster is determined, the possibility of computing node resource contention is reduced, the utilization rate and the execution efficiency of deep learning task system resources are improved, the deep learning task to be processed is divided into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, the interference level and the communication cost of the target subtasks are analyzed, the target GPU of the target subtasks in the target computing node is determined, the resource distribution imbalance on the GPUs in the computing nodes is avoided, the high parallelization of the deep learning task is realized, and the resource utilization rate of the GPU cluster is improved, meanwhile, the execution efficiency of the deep learning task is improved.

Description

GPU cluster deep learning task parallelization method and device and electronic equipment
Technical Field
The application relates to the technical field of internet, in particular to a GPU cluster deep learning task parallelization method and device and electronic equipment.
Background
With the continuous and deep study, the deep learning technology obtains great achievements in the fields of computer vision, speech recognition, text processing and the like, and brings great convenience to the life of people. However, the complex neural network model and the huge amount of data put higher demands on the computing power. The GPU (Graphic Processing Unit) cluster integrates a plurality of GPU computing resources, provides powerful and efficient parallel computing capability for the computation-intensive deep learning task, and effectively meets the computing requirements of a plurality of deep learning tasks.
However, when the deep learning task runs on the resource-shared GPU cloud platform, the execution efficiency of the deep learning task is affected by interference caused by resource competition among other commonly-executed tasks. Therefore, for the deep learning task in the GPU cluster, how to perform task parallelization scheduling according to the information of the task and the requirement of the task on resources is achieved, the nodes in the GPU cluster and a plurality of GPU resources on the nodes are reasonably utilized, and the method is of great importance for optimizing the execution time of the deep learning task, improving the processing performance of the whole computing task and improving the resource utilization rate of the system.
The mainstream GPU cluster at present mainly employs a traditional scheduler (for example, kubernets, Yarn) to schedule the deep learning task. Through counting the use condition of the whole resources, the resources are reasonably allocated to the GPU for use, and enough resources are ensured to run in the life cycle of the GPU.
Although the method realizes parallelization of the deep learning task to a certain extent, the method mainly considers the use condition of resources, does not consider the physical characteristics of the resources and the characteristics of the task, cannot realize efficient parallelization of the deep learning task, and can reduce the execution efficiency of the deep learning workload. Meanwhile, the method does not support fine-grained multitask allocation of the GPU, GPU resources on the nodes cannot be fully utilized, efficient execution of deep learning tasks is affected, GPU utilization rate of the nodes is reduced, and therefore resource utilization rate of a GPU cluster is affected.
Disclosure of Invention
The embodiment of the application aims to provide a parallelization method and device for a deep learning task of a GPU cluster, electronic equipment, a storage medium and a computer program product containing instructions, so that high parallelization of the deep learning task is realized, and the utilization rate of GPU resources in the GPU cluster and the execution efficiency of the deep learning task are improved.
The specific technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a GPU cluster deep learning task parallelization method, including:
acquiring task set information of deep learning tasks to be processed and a GPU cluster, wherein the task set information of the GPU cluster comprises task information of each task in a waiting queue of each computing node of the GPU cluster and task information of each subtask being executed by each GPU in each computing node of the GPU cluster;
analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster to respectively obtain the similarity of the deep learning task to be processed and each computing node of the GPU cluster;
determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity;
dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;
and respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask.
Optionally, the task information of the deep learning task to be processed includes: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
Optionally, the analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster respectively includes:
inputting the deep learning task information to be processed and the task information of each task in each computing node waiting queue of the GPU cluster into a preset similarity prediction model to respectively obtain the feature vector of the deep learning task to be processed and the feature vector of each task in each computing node waiting queue of the GPU cluster;
respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure GDA0003241258490000031
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the same as JiAnd said SjkAngle of (a) to degrees, cos (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster.
Optionally, the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:
mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;
respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;
respectively calculating the interference level of the target subtask according to each performance degradation;
and respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model.
Optionally, the determining, in the GPU of the target computing node, the execution GPU of each target subtask according to the interference level and the communication cost of each target subtask includes:
determining an objective function of a target GPU, wherein the objective function is as follows:
Figure GDA0003241258490000041
in the formula
Figure GDA0003241258490000042
For the objective function value, i (M) represents the interference level when the mapping relationship is M, c (M) represents the communication cost when the mapping relationship is M, α is the weight of i (M), β is the weight of c (M), and follows α + β ═ 1;
when in use
Figure GDA0003241258490000043
And when the mapping relation is M, the corresponding GPU is the target GPU.
Optionally, the determining, in the GPU of the target computing node, the execution GPU of each target subtask according to the interference level and the communication cost of each target subtask includes:
and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.
In a second aspect, an embodiment of the present application provides a GPU cluster deep learning task parallelization apparatus, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring task set information of deep learning tasks to be processed and GPU clusters, and the task set information of the GPU clusters comprises task information of each task in a waiting queue of each computing node of the GPU clusters and task information of each subtask being executed by each GPU in each computing node of the GPU clusters;
the first analysis module is used for analyzing the task information of the deep learning task to be processed and the task information of each task in each computing node waiting queue of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster;
the calculation node determination module is used for determining a target calculation node of the deep learning task to be processed in the GPU cluster according to the similarity;
the subtask module is used for dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
the second analysis module is used for analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;
and the GPU determining module is used for determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask.
Optionally, the task information of the deep learning task to be processed includes: the processing method comprises the following steps of obtaining a maximum CPU utilization rate of the deep learning task to be processed, a host memory utilization rate of the deep learning task to be processed, I/O throughput of the deep learning task to be processed, GPU utilization rate of the deep learning task to be processed, device memory utilization rate of the deep learning task to be processed, bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
Optionally, the first analysis module is specifically configured to:
inputting the deep learning task information to be processed and the task information in the waiting queue of each computing node of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of a task in the waiting queue of each computing node of the GPU cluster;
respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure GDA0003241258490000051
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the same as JiAnd said SjkAngle of (a) to degrees, cos (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. Optionally, the second analysis module is specifically configured to:
mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;
respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;
respectively calculating the interference level of the target subtask according to each performance degradation;
and respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model.
Optionally, the second analysis module is specifically configured to:
determining an objective function of a target GPU, wherein the objective function is as follows:
Figure GDA0003241258490000061
in the formula
Figure GDA0003241258490000062
For the objective function value, i (M) represents the interference level when the mapping relationship is M, c (M) represents the communication cost when the mapping relationship is M, α is the weight of i (M), β is the weight of c (M), and follows α + β ═ 1;
when in use
Figure GDA0003241258490000063
And when the mapping relation is M, the corresponding GPU is the target GPU.
Optionally, the second analysis module is specifically configured to:
and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus, wherein:
the processor, the communication interface and the memory complete mutual communication through a communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the GPU cluster deep learning task parallelization method according to any one of the first aspect, when executing a program stored in a memory.
In a fourth aspect, an embodiment of the present application provides a storage medium, where instructions are stored in the storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute the GPU cluster deep learning task parallelization method according to any one of the first aspects.
In a fifth aspect, an embodiment of the present application provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the GPU cluster deep learning task parallelization method according to any one of the first aspect.
The GPU cluster deep learning task parallelization method, the device, the electronic equipment, the storage medium and the computer program product containing instructions provided by the embodiment of the application determine a target computing node of a deep learning task to be processed in a GPU cluster by analyzing the similarity of the deep learning task to be processed and each computing node of the GPU cluster, reduce the possibility of computing node resource contention by fully considering the similarity between the deep learning task to be processed and other tasks, thereby improving the utilization rate and the execution efficiency of deep learning task system resources, divide the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, analyze the interference level and the communication cost of the target subtasks, thereby determining the target GPU of the target subtasks in the target computing nodes, and by considering the interference level and the communication cost of the target subtasks, resource distribution imbalance on the GPU in the computing nodes is avoided, high parallelization of deep learning tasks is achieved, the resource utilization rate of the GPU cluster is improved, and meanwhile execution efficiency of the deep learning tasks is improved. Of course, not all of the above advantages need be achieved in the practice of any one product or method of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a GPU cluster deep learning task parallelization method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a GPU cluster deep learning task parallelization apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application discloses a GPU cluster deep learning task parallelization method and device, electronic equipment, a storage medium and a computer program product containing instructions, which are respectively described below.
The embodiment of the application provides a parallelization method for a deep learning task of a GPU cluster, and referring to fig. 1, fig. 1 is a schematic diagram of the parallelization method for the deep learning task of the GPU cluster, and the parallelization method comprises the following steps:
and step 110, acquiring task set information of the deep learning task to be processed and the GPU cluster, wherein the task set information of the GPU cluster comprises task information of each task in a waiting queue of each computing node of the GPU cluster and task information of each subtask being executed by each GPU in each computing node of the GPU cluster.
The GPU cluster deep learning task parallelization method can be realized through electronic equipment, and specifically, the electronic equipment can be a server.
In order to improve the computation performance of the GPU, the GPU cluster can be transversely expanded, namely the GPU cluster is formed by a plurality of GPUs on a plurality of nodes, and therefore the GPU cluster integrates the plurality of GPUs to complete complex computation tasks. Therefore, when the GPU cluster is used for large-scale deep learning random training, the number of deep learning tasks in the GPU cluster can be multiple, the deep learning task to be processed is any one of the deep learning tasks, the deep learning task to be processed is defined as J if the system has p deep learning tasks to be processediI ∈ {1, …, p }. The GPU cluster comprises a task set, wherein the task set comprises tasks in a waiting queue of each computing node of the GPU cluster and subtasks which are executed by each GPU in each computing node of the GPU cluster.
And step 120, analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster.
In order to improve the execution efficiency of the GPU cluster and the utilization rate of the computing resources, the similarity between the deep learning task to be processed and each computing node of the GPU cluster needs to be calculated, so as to avoid interference between the computing resources.
In a possible implementation manner, the task information of the deep learning task to be processed includes: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
In the embodiment of the present application, the above deep learning task to be processed is characterized as Ji=(jUCPU,jUhMem,jThPI/O,jUGPU,jUdMem,jThPPCIeBatch size, dsize), of which jUCPUjU representing the maximum CPU usage of the pending deep learning taskhMemRepresenting the host memory utilization rate and jThP of the deep learning task to be processedI/OjU representing the I/O throughput of the pending deep learning taskGPUjU representing GPU utilization of the deep learning task to be processeddMemRepresenting the device memory usage rate, jThp, of the deep learning task to be processedPCIeThe bandwidth utilization rate of the deep learning task to be processed is represented, the batch _ size represents the number of samples to be analyzed in each training step of the deep learning task to be processed, and the dsize represents the size of the data set of the deep learning task to be processed.
In a possible implementation manner, the analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each compute node of the GPU cluster to obtain similarity between the deep learning task to be processed and each compute node of the GPU cluster respectively includes:
inputting the deep learning task information to be processed and the task information in the waiting queue of each computing node of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of a task in the waiting queue of each computing node of the GPU cluster;
and respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure GDA0003241258490000091
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the above JiAnd the above-mentioned SjkAngle of (a) to degrees, cos (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster.
The preset Similarity Prediction model may be an IASP (Interference-Aware Similarity Prediction) model.
For example, a to-be-processed deep learning task-metric matrix M is definedJTo characterize the deep learning task, wherein the matrix MJEach row of the table represents one deep learning task to be processed, and each column represents one performance characteristic of the deep learning task to be processed, namely the performance characteristic is composed of CPU resource utilization rate and GPU resource utilization rate.
Obtaining the deep learning task J to be processediStandardizing the task information of the deep learning task to be processed and filling the task information into the matrix MJFor each to-be-processed deep learning task J in the queue in the corresponding rowiAnd analyzing the data by using a small data volume, and randomly selecting two features from the features of the deep learning task to be processed to obtain the feature values of the two features. Using virtual file systems, i.e. proc textAnalyzing the CPU resource utilization rate by the system to obtain CPU measurement, analyzing the GPU resource utilization rate by using a performance analysis tool such as an NVIDIA Profiler tool to obtain GPU measurement, and filling the performance measurement obtained by analysis into a deep learning task characterization vector J to be processediAnd vector J is combinediInserted into matrix MJIn (1).
Predicting the above matrix M using IASP modeljThe feature predicted by IASP model to the above matrix MjAnd filling to obtain a complete matrix. And obtaining the characteristic vector of the deep learning task to be processed through an IASP model. Similarly, the task information of the tasks in the waiting queue of each computing node of the GPU cluster is input into an IASP model, and the feature vectors of the tasks in the waiting queue of each computing node of the GPU cluster are obtained.
Specifically, the IASP model is a depth Collaborative Filtering model, such as DCF (Deep Collaborative Filtering) model, for similarity prediction. Further, the DCF model can be optimized using a stochastic gradient descent method (SGD).
And respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure GDA0003241258490000111
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the above JiAnd the above-mentioned SjkAngle of (a) coS (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. For example, JiAnd SjHas a similarity of JiAnd SjThe average value of the similarity of each task in the process is calculated, namely the deep learning task J to be processed is calculated respectivelyiAnd S in waiting queue of each computing node of the GPU clusterjAfter the similarity between the n tasks, calculating the average value of the similarities to obtain the similarity between the ith deep learning task to be processed and the task in the jth computing node waiting queue of the GPU cluster.
Because of SjIf n tasks are included, the similarity between the ith deep learning task to be processed and each task in the jth computing node waiting queue in the GPU cluster needs to be calculated, and the average value of all the similarities is calculated, so as to obtain the similarity between the ith deep learning task to be processed and the task in the jth computing node waiting queue in the GPU cluster.
And step 130, determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity.
And according to the similarity, the calculation node with the minimum similarity in the GPU cluster is the target calculation node of the deep learning task to be processed in the GPU cluster.
And the computing node with the minimum similarity is selected, so that the contention of computing resources can be avoided, and the execution efficiency of the GPU cluster and the utilization rate of the computing resources are improved.
And 140, dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed.
After the deep learning task to be processed is distributed to the target computing nodes in the GPU cluster, the deep learning task to be processed is processed according to the number of GPUs required by the deep learning task to be processedThe learning task is divided into a plurality of target subtasks, such as a to-be-processed deep learning task JiDefining the deep learning task to be processed into a plurality of target subtasks
Figure GDA0003241258490000121
Wherein n represents a deep learning task J to be processediThe number of subtasks in (1).
Defining a target subtask as
Figure GDA0003241258490000122
j∈{1,...,n},
Figure GDA0003241258490000123
n represents that the deep learning task to be processed is divided into n target subtasks in total, wherein
Figure GDA0003241258490000131
Representing the jth target subtask.
Step 150, analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node, and obtaining the interference level and the communication cost of each target subtask respectively.
Defining the GPU set in the target computing node as GkK is formed by { 1.,..,. m }, wherein m GPUs in the target computing node are represented, and G is the total number of the GPU in the target computing nodekThe kth GPU.
Defining a target subtask as
Figure GDA0003241258490000132
Figure GDA0003241258490000133
Wherein tESM、tUL1、tThPL1、tUL2、tThPL2、tUTex、tThPTex、tUDRAM、tThPDRAM、tThPLAnd tThPSRespectively representing SM efficiency, GPU L1 cache for the above-described target subtasksMemory usage, L1 read throughput, GPU L2 cache usage, L2 read throughput, GPU texture cache usage, texture cache read throughput, GPU memory usage, memory read throughput, global load throughput, global store throughput, sub _ dsize representing the size of the data set of the target subtask.
In one possible implementation, the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:
mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;
respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;
respectively calculating the interference level of the target subtask according to each performance degradation;
and respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model.
Defining the ratio of the completion time of a subtask running with other subtasks to the completion time of the individual run represents performance degradation.
The target subtasks and the subtasks being executed by the GPUs are respectively input into a preset Performance Prediction model, where the preset Performance Prediction model may be an IAPP (Interference-Aware Performance Prediction) model, and specifically, the preset Performance Prediction model may be a DNN (Deep Neural Network) model.
For example, define a target subtask as
Figure GDA0003241258490000141
J∈{1,...,n},
Figure GDA0003241258490000142
n represents that the deep learning task to be processed is divided into n target subtasks in total, wherein
Figure GDA0003241258490000143
Representing the jth target subtask, and defining each GPU set in the target computing nodes as G ═ G1…GmIndicating that there are, one GPU in the target computing nodes, and converting the plurality of target subtasks into a plurality of target subtasks
Figure GDA0003241258490000144
Mapping the target sub tasks to the GPUs in the target computing nodes respectively to obtain mapping relations between the target sub tasks and the GPUs, defining the mapping relations to be M (j) ═ k, j ∈ { 1., n }, k ∈ { 1., m }, wherein n represents that the deep learning task to be processed is divided into n target sub tasks in total, m represents that m GPUs are in total in the target computing nodes, and M (j) ═ k represents that the j target sub task is
Figure GDA0003241258490000145
Is assigned to the kth GPU, Gk
In particular, a subtask-metric matrix M is definedtTo characterize the subtasks, which are the subtasks to be executed by the target subtasks or the GPUs.
Each subtask being executed by each GPU is to form a matrix MtThe output of the vector-in IAPP model is the vector ztVector ztEach element in (a) represents a performance degradation of the target subtask when executed in conjunction with other subtasks.
Calculating the objective subtask using the following formula
Figure GDA0003241258490000146
The average performance allocated to one GPU degrades.
Figure GDA0003241258490000151
Wherein slow downjkRepresenting target subtasks
Figure GDA0003241258490000156
And the average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node.
Defining an interference level as
Figure GDA0003241258490000152
Wherein I (M (j)) represents the jth target subtask
Figure GDA0003241258490000157
The mapping relation of (a) is the interference level of M (j), and n represents that the deep learning task to be processed is divided into n target subtasks. Will be from GiTo GjIs defined as ccij,i∈{1,...,m},j∈{1,...,m},GiRepresents the ith GPU, G in the target computing nodejAnd representing the jth GPU in the target computing node. The communication cost may be calculated from the available bandwidth between the physical GPUs, where high available bandwidth means low communication cost and low available bandwidth means high communication cost.
Sub-task to be targeted
Figure GDA0003241258490000153
And target subtasks
Figure GDA0003241258490000158
The communication requirement between is defined as crijThe communication requirement definition is equal to the amount of data required to update the model. Obtaining the target subtask according to the following formula
Figure GDA0003241258490000154
And target subtasks
Figure GDA0003241258490000159
Communication cost therebetween:
Figure GDA0003241258490000155
c (M) represents the communication cost when the mapping relation is M.
And step 160, determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.
By considering the interference level and the communication cost of the target subtask, unbalanced resource distribution on the GPU in the computing node is avoided, high parallelization of the deep learning task is realized, the resource utilization rate of the GPU cluster is improved, and the execution efficiency of the deep learning task is improved.
In one possible implementation, the determining, in the GPU of the target computing node, an execution GPU of each target subtask according to the interference level of each target subtask and the communication cost includes:
determining an objective function of a target GPU, wherein the objective function is as follows:
Figure GDA0003241258490000161
in the formula
Figure GDA0003241258490000162
For the objective function value, i (M) represents the interference level when the mapping relationship is M, c (M) represents the communication cost when the mapping relationship is M, α is the weight of i (M), β is the weight of c (M), and follows α + β ═ 1;
when in use
Figure GDA0003241258490000163
When the mapping relation is M, the corresponding GPU is the target GPU.
By considering the interference level and the communication cost of the target subtask, unbalanced resource distribution on the GPU in the computing node is avoided, high parallelization of the deep learning task is realized, the resource utilization rate of the GPU cluster is improved, and the execution efficiency of the deep learning task is improved.
In one possible implementation, the determining, in the GPU of the target computing node, an execution GPU of each target subtask according to the interference level of each target subtask and the communication cost includes:
and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.
For example, the particle swarm optimization algorithm initializes a population of particles, which represent the target subtasks to be assigned. Then we set G ═ G from GPU1…GmRandomly initializing particle positions and randomly initializing particle velocities, the fitness value assigned to each dimension of a particle is the number of the GPU, e.g. k is the kth GPU, i.e. the particle represents a mapping of multiple tasks to the GPU.
Each particle is determined by its current position
Figure GDA0003241258490000164
And current speed
Figure GDA0003241258490000165
Indicating and each particle knows the best position found by the particle pbest as well as the global best position gbest in the entire particle population so far. The principle of the algorithm is to move the particles to find the optimal solution. Each particle position is affected by its best position pbest and global best position gbest. Each particle may update its best position pbest using the fitness value calculated by optimizing the fitness function in each generation. During each iteration, each instance may update its velocity and position using the following formulas.
Figure GDA0003241258490000171
Figure GDA0003241258490000172
Where ω is the inertial weight for maintaining the particle, c1And c2Is the acceleration coefficient, r1And r2Is a random number between 0 and 1.
Using an objective function
Figure GDA0003241258490000173
To perform an evaluation of each particle and then update the velocity and position of each particle according to the following two formulas:
Figure GDA0003241258490000174
Figure GDA0003241258490000175
and (5) performing iteration until the specified iteration times are reached or the iteration precision is met, and finding the optimal GPU.
And determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, so that the resource utilization rate of a GPU cluster can be improved to the maximum extent, and the execution efficiency of a deep learning task can be improved to the maximum extent.
The method comprises the steps of firstly analyzing the similarity of a deep learning task to be processed and each computing node of a GPU cluster, determining a target computing node of the deep learning task to be processed in the GPU cluster, fully considering the similarity between the deep learning task to be processed and other tasks to reduce the possibility of computing node resource contention, thereby improving the utilization rate and the execution efficiency of deep learning task system resources, dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, analyzing the interference level and the communication cost of the target subtasks, thereby determining the target GPU of the target subtasks in the target computing node, avoiding the resource distribution imbalance on the GPU in the computing nodes by considering the interference level and the communication cost of the target subtasks, realizing the high parallelization of the deep learning task, and improving the resource utilization rate of the GPU cluster, meanwhile, the execution efficiency of the deep learning task is improved.
An embodiment of the present application further provides a device, referring to fig. 2, where fig. 2 is a schematic diagram of a GPU cluster deep learning task parallelization device according to an embodiment of the present application, where the device includes:
the acquisition module 210 is configured to acquire a deep learning task to be processed and task set information of a GPU cluster, where the task set information of the GPU cluster includes task information of each task in a waiting queue of each compute node of the GPU cluster and task information of each subtask being executed by each GPU in each compute node of the GPU cluster;
a first analysis module 220, configured to analyze task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster, and obtain similarities between the deep learning task to be processed and each computing node of the GPU cluster, respectively;
a computing node determining module 230, configured to determine, according to the similarity, a target computing node of the to-be-processed deep learning task in the GPU cluster;
the subtask module 240 is configured to divide the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
a second analysis module 250, configured to analyze task information of the target subtask and task information of each subtask being executed by each GPU in the target computing node, and obtain an interference level and a communication cost of each target subtask, respectively;
and a GPU determining module 260, configured to determine, in the GPUs of the target computing nodes, execution GPUs of the target subtasks respectively according to the interference levels of the target subtasks and the communication costs.
In a possible implementation manner, the task information of the deep learning task to be processed includes: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
In a possible implementation manner, the first analysis module 220 is specifically configured to:
inputting the deep learning task information to be processed and the task information in the waiting queue of each computing node of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of a task in the waiting queue of each computing node of the GPU cluster;
and respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure GDA0003241258490000191
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the above JiAnd the above-mentioned SjkAngle of (a) to degrees, cos (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. In a possible implementation manner, the second analysis module 250 is specifically configured to:
mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;
respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;
respectively calculating the interference level of the target subtask according to each performance degradation;
and respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model.
In a possible implementation manner, the second analysis module 250 is specifically configured to:
determining an objective function of a target GPU, wherein the objective function is as follows:
Figure GDA0003241258490000192
in the formula
Figure GDA0003241258490000193
For the objective function value, i (M) represents the interference level when the mapping relationship is M, c (M) represents the communication cost when the mapping relationship is M, α is the weight of i (M), β is the weight of c (M), and follows α + β ═ 1;
when in use
Figure GDA0003241258490000201
When the mapping relation is M, the corresponding GPU is the target GPU.
In a possible implementation manner, the second analysis module 250 is specifically configured to:
and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.
An embodiment of the present application further provides an electronic device, see fig. 3, including: a processor 310, a communication interface 320, a memory 330 and a communication bus 340, wherein the processor 310, the communication interface 320 and the memory 330 are communicated with each other through the communication bus 340,
the memory 330 is used for storing computer programs;
the processor 310 is configured to implement the following steps when executing the computer program stored in the memory 330:
acquiring task set information of deep learning tasks to be processed and a GPU cluster, wherein the task set information of the GPU cluster comprises task information of each task in a waiting queue of each computing node of the GPU cluster and task information of each subtask being executed by each GPU in each computing node of the GPU cluster;
analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster;
determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity;
dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;
and respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.
For example, the processor 310 of the electronic device includes a GPU cluster composed of a central control unit and a plurality of GPUs, where the GPU cluster includes a plurality of compute nodes, each compute node is composed of a plurality of GPUs, the central control unit includes a data collector, a cluster management unit, and a node management unit, and the electronic device is configured to process a plurality of deep learning tasks in parallel in the GPU cluster. The data acquisition unit acquires task set information of a deep learning task to be processed and a GPU cluster, the cluster management unit analyzes the task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster to respectively obtain the similarity of the deep learning task to be processed and each computing node of the GPU cluster, and determines a target computing node of the deep learning task to be processed in the GPU cluster according to the similarity, the node management unit divides the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, then analyzes the task information of the target subtasks and the task information of each subtask being executed by each GPU in the target computing nodes to respectively obtain the interference level and the communication cost of each target subtask, and finally, respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.
Optionally, when the processor 310 is configured to execute the program stored in the memory 330, any of the GPU cluster deep learning task parallelization methods described above may also be implemented.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In an embodiment of the present application, a storage medium is further provided, where instructions are stored in the storage medium, and when the instructions are executed on a computer, the computer is caused to execute any GPU cluster deep learning task parallelization method in the foregoing embodiments.
In an embodiment of the present application, a computer-readable storage medium is further provided, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is caused to execute any of the GPU cluster deep learning task parallelization methods in the foregoing embodiments.
It should be noted that, in this document, the technical features in the various alternatives can be combined to form the scheme as long as the technical features are not contradictory, and the scheme is within the scope of the disclosure of the present application. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the same element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.
The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims (9)

1. A GPU cluster deep learning task parallelization method is characterized by comprising the following steps:
acquiring task set information of deep learning tasks to be processed and GPU clusters, wherein the task set information of the GPU clusters comprises task information of tasks in waiting queues of computing nodes of the GPU clusters and task information of subtasks being executed by GPUs in the computing nodes of the GPU clusters;
analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster to respectively obtain the similarity of the deep learning task to be processed and each computing node of the GPU cluster;
determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity;
dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;
respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask;
the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:
mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;
respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;
respectively calculating the interference level of the target subtask according to each performance degradation;
respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model;
wherein the target subtask T is calculated using the following formulai jAverage performance degradation assigned to one GPU;
Figure FDA0003241258480000021
wherein slow downjkRepresenting the target subtask Ti jThe average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node; vector ztEach element in the set representing a performance degradation of the target subtask when executed in conjunction with other subtasks; the performance is degraded toA ratio of a completion time of the target subtask when run with the other subtask to a completion time of the target subtask when run alone;
defining an interference level as
Figure FDA0003241258480000022
Wherein I (M (j)) represents the jth target subtask Ti jThe mapping relation of (a) is the interference level of M (j), and n represents that the deep learning task to be processed is divided into n target subtasks.
2. The method of claim 1, wherein the task information of the deep learning task to be processed comprises: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
3. The method according to claim 1, wherein the analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each compute node of the GPU cluster to obtain similarity between the deep learning task to be processed and each compute node of the GPU cluster respectively comprises:
inputting the deep learning task information to be processed and the task information of each task in each computing node waiting queue of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of each task in each computing node waiting queue of the GPU cluster;
respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:
Figure FDA0003241258480000031
wherein, JiFeature vector representing the ith deep learning task to be processed, SjA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein SjComprising n tasks, SjkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, thetaJiSjkIs the same as JiAnd said SjkAngle of (a) to degrees, cos (theta)JiSjk)JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | JiI is JiModulo of vector, | SjkI is SjkA modulus of the vector;
and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster.
4. The method of claim 1, wherein the determining, in the GPU of the target compute node, the execution GPU for each of the target subtasks based on the interference level and the communication cost for each of the target subtasks comprises:
determining an objective function of a target GPU, wherein the objective function is as follows:
Figure FDA0003241258480000032
in the formula
Figure FDA0003241258480000033
For the objective function value, I (M) represents the interference level when the mapping relation is M, and C (M) represents the channel when the mapping relation is MA cost, α is the weight of i (m), β is the weight of c (m), and follows α + β ═ 1;
when in use
Figure FDA0003241258480000034
And when the mapping relation is M, the corresponding GPU is the target GPU.
5. The method according to any one of claims 1 to 4, wherein the determining, in the GPU of the target computing node, the execution GPU of each of the target subtasks according to the interference level and the communication cost of each of the target subtasks respectively comprises:
and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.
6. A GPU cluster deep learning task parallelization device is characterized by comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring task set information of deep learning tasks to be processed and GPU clusters, and the task set information of the GPU clusters comprises task information of each task in a waiting queue of each computing node of the GPU clusters and task information of each subtask being executed by each GPU in each computing node of the GPU clusters;
the first analysis module is used for analyzing the task information of the deep learning task to be processed and the task information of each task in each computing node waiting queue of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster;
the calculation node determination module is used for determining a target calculation node of the deep learning task to be processed in the GPU cluster according to the similarity;
the subtask module is used for dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;
the second analysis module is used for analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;
a GPU determining module, configured to determine, in the GPU of the target computing node, a GPU for executing each of the target subtasks, according to the interference level and the communication cost of each of the target subtasks;
the second analysis module is specifically configured to map the multiple target subtasks to each GPU in the target computing node, so as to obtain a mapping relationship between each target subtask and each GPU; respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together; respectively calculating the interference level of the target subtask according to each performance degradation; respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model;
wherein the target subtask T is calculated using the following formulai jAverage performance degradation assigned to one GPU;
Figure FDA0003241258480000051
wherein slow downjkRepresenting the target subtask Ti jThe average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node; vector ztEach element in the set representing a performance degradation of the target subtask when executed in conjunction with other subtasks; the performance degradation is a ratio of a completion time of the target subtask when run with other subtasks to a completion time of the target subtask when run alone;
defining an interference level as
Figure FDA0003241258480000052
Wherein I (M (j)) represents the jth target subtask Ti jThe mapping relation of (a) is the interference level of M (j), and n represents that the deep learning task to be processed is divided into n target subtasks.
7. The apparatus of claim 6, wherein the task information of the deep learning task to be processed comprises: the processing method comprises the following steps of obtaining a maximum CPU utilization rate of the deep learning task to be processed, a host memory utilization rate of the deep learning task to be processed, I/O throughput of the deep learning task to be processed, GPU utilization rate of the deep learning task to be processed, device memory utilization rate of the deep learning task to be processed, bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.
8. An electronic device, comprising: a processor, a communication interface, a memory, and a communication bus, wherein,
the processor, the communication interface and the memory complete mutual communication through a communication bus;
a memory for storing a computer program;
a processor configured to implement the GPU cluster deep learning task parallelization method of any of claims 1-5 when executing a program stored on a memory.
9. A storage medium having stored therein a computer program which, when executed by a processor, implements the GPU cluster deep learning task parallelization method of any of claims 1-5.
CN201910675587.9A 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment Active CN110399222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910675587.9A CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910675587.9A CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110399222A CN110399222A (en) 2019-11-01
CN110399222B true CN110399222B (en) 2022-01-21

Family

ID=68325235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910675587.9A Active CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110399222B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112965809A (en) * 2019-12-12 2021-06-15 深圳市优必选科技股份有限公司 Deep learning task processing system and method
CN111104289B (en) * 2019-12-25 2023-03-14 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster
CN111258735A (en) * 2020-01-16 2020-06-09 中国人民解放军国防科技大学 Deep learning task scheduling method supporting QoS (quality of service) perception of user
CN111309479B (en) 2020-02-14 2023-06-06 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
CN111866187B (en) * 2020-06-30 2022-10-04 中科院计算所西部高等技术研究院 Task scheduling method for distributed deep learning reasoning cloud platform
CN111913799B (en) * 2020-07-14 2024-04-19 北京华夏启信科技有限公司 Video stream online analysis task scheduling method and computer equipment
CN112416585B (en) * 2020-11-20 2024-03-15 南京大学 Deep learning-oriented GPU resource management and intelligent scheduling method
CN112584143B (en) * 2020-12-02 2022-09-06 浙江大华技术股份有限公司 Video coding method, device and system and computer readable storage medium
WO2022116142A1 (en) * 2020-12-04 2022-06-09 深圳大学 Resource scheduling method based on graph neural network
CN113194086B (en) * 2021-04-27 2022-05-27 新华三信息安全技术有限公司 Anti-attack method and device
CN113377520B (en) * 2021-07-07 2023-03-24 北京百度网讯科技有限公司 Resource scheduling method, device, equipment and storage medium
CN113900793B (en) * 2021-07-29 2023-11-10 苏州浪潮智能科技有限公司 Server cluster and deep learning aggregate communication system and method thereof
CN114285766B (en) * 2021-08-20 2023-06-13 腾讯科技(深圳)有限公司 Network bandwidth detection method and device, electronic equipment and storage medium
CN114116220A (en) * 2021-11-29 2022-03-01 苏州浪潮智能科技有限公司 GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium
CN114138449A (en) * 2021-12-14 2022-03-04 河南省儿童医院郑州儿童医院 Rehabilitation training system based on virtual reality
CN117521841A (en) * 2022-07-28 2024-02-06 华为技术有限公司 Deep learning system and method
CN115248728B (en) * 2022-09-21 2023-02-03 之江实验室 Distributed training task scheduling method, system and device for intelligent computing
CN115373861B (en) * 2022-10-26 2022-12-27 小米汽车科技有限公司 GPU resource scheduling method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045456A (en) * 2016-02-05 2017-08-15 华为技术有限公司 A kind of resource allocation methods and explorer
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN109101339A (en) * 2018-08-15 2018-12-28 北京邮电大学 Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7219085B2 (en) * 2003-12-09 2007-05-15 Microsoft Corporation System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit
US8707300B2 (en) * 2010-07-26 2014-04-22 Microsoft Corporation Workload interference estimation and performance optimization
US9058217B2 (en) * 2012-09-14 2015-06-16 International Business Machines Corporation Preferential CPU utilization for tasks
WO2016078008A1 (en) * 2014-11-19 2016-05-26 华为技术有限公司 Method and apparatus for scheduling data flow task
CN107329828B (en) * 2017-06-26 2019-10-08 华中科技大学 A kind of data flow programmed method and system towards CPU/GPU isomeric group
CN109936604B (en) * 2017-12-18 2022-07-26 北京图森智途科技有限公司 Resource scheduling method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045456A (en) * 2016-02-05 2017-08-15 华为技术有限公司 A kind of resource allocation methods and explorer
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN109101339A (en) * 2018-08-15 2018-12-28 北京邮电大学 Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms";George Teodoro et.al;《arXiv》;20120903;全文 *
"DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment";Wei QIAO et.al;《ITM Web of Conferences》;20171231;全文 *
"Learning Driven Parallelization for Large-Scale Video Workload in Hybrid CPU-GPU Cluster";Haitao Zhang et.al;《ICPP 2018》;20180816;全文 *
"Multi-tenant GPU clusters for deep learning workloads";Jeon, M. et.al;《Technical report, MSR-TR-2018》;20181231;全文 *

Also Published As

Publication number Publication date
CN110399222A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110399222B (en) GPU cluster deep learning task parallelization method and device and electronic equipment
KR102628902B1 (en) Processing computational graphs
Mapetu et al. Low-time complexity and low-cost binary particle swarm optimization algorithm for task scheduling and load balancing in cloud computing
CN110737529B (en) Short-time multi-variable-size data job cluster scheduling adaptive configuration method
CN110231986B (en) Multi-FPGA-based dynamically reconfigurable multi-task scheduling and placing method
Cariño et al. Dynamic load balancing with adaptive factoring methods in scientific applications
You et al. Comprehensive workload analysis and modeling of a petascale supercomputer
Varghese et al. Cloud benchmarking for maximising performance of scientific applications
CN108205469B (en) MapReduce-based resource allocation method and server
Nadeem et al. Optimizing execution time predictions of scientific workflow applications in the grid through evolutionary programming
Muthusamy et al. Cluster-based task scheduling using K-means clustering for load balancing in cloud datacenters
Geng et al. Interference-aware parallelization for deep learning workload in GPU cluster
CN113886080A (en) High-performance cluster task scheduling method and device, electronic equipment and storage medium
Li et al. Efficient response time predictions by exploiting application and resource state similarities
CN112000460A (en) Service capacity expansion method based on improved Bayesian algorithm and related equipment
CN114217930A (en) Accelerator system resource optimization management method based on mixed task scheduling
Mirsoleimani et al. A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments
Wang et al. On mapreduce scheduling in hadoop yarn on heterogeneous clusters
Li et al. The extreme counts: modeling the performance uncertainty of cloud resources with extreme value theory
WO2023097661A1 (en) Big data system resource configuration parameter tuning method based on generative adversarial network
Ismaeel et al. A systematic cloud workload clustering technique in large scale data centers
Yassir et al. Graph-based model and algorithm for minimising big data movement in a cloud environment
CN114741161A (en) HPC job cluster sensing method based on mixed cluster
Perez et al. Bottleneck-aware task scheduling based on per-stage and multi-ml profiling
Zhang et al. Comprehensive workload analysis and modeling of a petascale supercomputer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant