CN110399222A - GPU cluster deep learning task parallel method, device and electronic equipment - Google Patents
GPU cluster deep learning task parallel method, device and electronic equipment Download PDFInfo
- Publication number
- CN110399222A CN110399222A CN201910675587.9A CN201910675587A CN110399222A CN 110399222 A CN110399222 A CN 110399222A CN 201910675587 A CN201910675587 A CN 201910675587A CN 110399222 A CN110399222 A CN 110399222A
- Authority
- CN
- China
- Prior art keywords
- gpu
- deep learning
- processed
- task
- learning task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Debugging And Monitoring (AREA)
- Computer And Data Communications (AREA)
Abstract
GPU cluster deep learning task parallel method provided by the embodiments of the present application, device and electronic equipment, it is related to Internet technical field, by the similitude for first analyzing each calculate node of deep learning task and GPU cluster to be processed, determine target computing nodes of the deep learning task to be processed in GPU cluster, a possibility that reduce calculate node contention for resources, to improve the utilization rate and execution efficiency of deep learning task system resource, further according to GPU number of the needs of deep learning task to be processed, deep learning task to be processed is divided into multiple target subtasks, analyze the disturbance level and communication price of target subtask, so that it is determined that target GPU of the target subtask in target computing nodes, it is unbalanced to avoid the resource allocation on GPU in calculate node, realize deep learning task Highly-parallel, improves the resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.
Description
Technical field
This application involves Internet technical fields, more particularly to GPU cluster deep learning task parallel method, device
And electronic equipment.
Background technique
With deepening continuously for deep learning research, depth learning technology is in computer vision, speech recognition, text-processing
Equal fields achieve great successes, bring great convenience for people's lives.However, complicated neural network mould
More stringent requirements are proposed to computing capability for the data of type and substantial amounts.GPU (Graphic Processing Unit, image
Processor) the multiple GPU computing resources of cluster integration, it is provided for computation-intensive deep learning task powerful, efficient parallel
Computing capability efficiently solves the calculating demand of multiple deep learning tasks.
However when deep learning task is run in the GPU cloud platform of resource-sharing, execution efficiency can be total to by other
It is influenced with interference caused by the resource contention between execution task.Therefore for the deep learning task in GPU cluster, how
Demand according to the information of task and task to resource carries out task parallelization scheduling, realizes to GPU cluster interior joint and section
The reasonable utilization of multiple GPU resources on point promotes the place of entire calculating task for optimizing the execution time of deep learning task
Rationality energy, the resource utilization for improving system are most important.
The GPU cluster of mainstream mainly uses traditional scheduler (such as: Kubernetes, Yarn) to dispatch depth at present
Learning tasks.By counting the service condition of whole resource, GPU use is reasonably allocated resources to, and ensure the life of GPU
There are enough resources in period to guarantee its operation.
Although this method realizes the parallelization of deep learning task to a certain extent, this method mainly considers to provide
The service condition in source does not account for the physical characteristic of resource and the characteristic of task itself, can not achieve the height of deep learning task
Parallelization is imitated, the execution efficiency of deep learning workload can be reduced.Meanwhile this method does not support the fine granularity multitask of GPU
Distribution, cannot make full use of the GPU resource on node, will affect the efficient execution of deep learning task, reduce the GPU benefit of node
With rate, to influence the resource utilization of GPU cluster.
Summary of the invention
Be designed to provide the GPU cluster deep learning task parallel method, device, electronics of the embodiment of the present application are set
Standby, storage medium and computer program product comprising instruction realize the highly-parallel of deep learning task, improve GPU
The execution efficiency of the utilization rate of GPU resource and deep learning task in cluster.
Specific technical solution is as follows:
In a first aspect, the embodiment of the present application provides a kind of GPU cluster deep learning task parallel method, comprising:
Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of the GPU cluster
Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster
The mission bit stream of capable each subtask;
It analyzes in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed
The mission bit stream of each task respectively obtains the similar of the deep learning task to be processed and each calculate node of the GPU cluster
Property;
According to the similitude, determine that target of the deep learning task to be processed in the GPU cluster calculates section
Point;
According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into
Multiple target subtasks;
Analyze each son that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out
The mission bit stream of task respectively obtains the disturbance level and communication price of each target subtask;
According to the disturbance level of each target subtask and the communication price, in the GPU of the target computing nodes
The middle execution GPU for determining each target subtask respectively.
Optionally, the mission bit stream of information deep learning task to be processed, comprising: the deep learning to be processed is appointed
The maximum CPU usage of business, the host memory utilization rate of the deep learning task to be processed, the deep learning to be processed are appointed
The I/O handling capacity of business, the GPU utilization rate of the deep learning task to be processed, the equipment of the deep learning task to be processed
Memory usage, the bandwidth utilization rate of the deep learning task to be processed, each step of deep learning task to be processed
The sample size of analysis, the deep learning task to be processed data set size.
Optionally, the mission bit stream of the analysis deep learning task to be processed and the GPU cluster respectively calculate section
The mission bit stream of each task, respectively obtains the deep learning task to be processed and the GPU cluster is respectively counted in point waiting list
The similitude of operator node, comprising:
By each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster
Mission bit stream inputs in default similitude prediction model, described in the feature vector for respectively obtaining the deep learning task to be processed
The feature vector of each task in each calculate node waiting list of GPU cluster;
According to the following formula, the deep learning task to be processed and each calculate node of the GPU cluster etc. are calculated separately
To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster
The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster
The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is the JiWith the SjkAngle degree,
cos(θJisjk)JisjkRepresent j-th of calculate node etc. in described i-th deep learning task to be processed and the GPU cluster
To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it
Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.
Optionally, each GPU in the mission bit stream and the target computing nodes of the analysis target subtask is being just
Mission bit stream in each subtask of execution respectively obtains the disturbance level and communication price of each target subtask, comprising:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, target is obtained and appoints
The mapping relations of business and each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model,
Respectively obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain target
The communication cost of task.
Optionally, the disturbance level according to each target subtask and the communication price, in the target meter
The execution GPU of each target subtask is determined in the GPU of operator node respectively, comprising:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M
When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
Optionally, the disturbance level according to each target subtask and the communication price, in the target meter
The execution GPU of each target subtask is determined in the GPU of operator node respectively, comprising:
Determine the target subtask in the mesh according to default optimization algorithm, the disturbance level and the communication price
The target GPU in calculate node is marked, the default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle
Colony optimization algorithm.
Second aspect, the embodiment of the present application provide a kind of GPU cluster deep learning task parallelization device, comprising:
Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster
Task-set information include the mission bit stream of each task, each calculate node of GPU cluster in each calculate node waiting list of GPU cluster
In the mission bit stream of each subtask that is carrying out of each GPU;
First analysis module, the mission bit stream and the GPU cluster for analyzing the deep learning task to be processed are each
The mission bit stream of each task in calculate node waiting list respectively obtains the deep learning task to be processed and the GPU collection
The similitude of each calculate node of group;
Calculate node determining module, for determining the deep learning task to be processed described according to the similitude
Target computing nodes in GPU cluster;
Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed
Degree learning tasks are divided into multiple target subtasks;
Second analysis module, it is each in the mission bit stream and the target computing nodes for analyzing the target subtask
The mission bit stream for each subtask that GPU is carrying out respectively obtains the disturbance level and communication price of each target subtask;
GPU determining module, for according to each target subtask disturbance level and the communication price, in the mesh
Mark the execution GPU for determining each target subtask in the GPU of calculate node respectively.
Optionally, the mission bit stream of the deep learning task to be processed, comprising: the deep learning task to be processed
Maximum CPU usage, the host memory utilization rate of the deep learning task to be processed, the deep learning task to be processed
I/O handling capacity, the GPU utilization rate of the deep learning task to be processed, the device memory of the deep learning task to be processed
Utilization rate, the bandwidth utilization rate of the deep learning task to be processed, each step analysis of deep learning task to be processed
Sample size, at least one of the size of data set of the deep learning task to be processed.
Optionally, first analysis module is specifically used for:
By the task letter in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster
Breath inputs in default similitude prediction model, respectively obtain the deep learning task to be processed feature vector and the GPU
The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, the deep learning task to be processed and each calculate node of the GPU cluster etc. are calculated separately
To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster
The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster
The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is the JiWith the SjkAngle degree,
cos(θJiSjk)JisjkRepresent j-th of calculate node etc. in described i-th deep learning task to be processed and the GPU cluster
To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it
Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.Optionally,
Second analysis module is specifically used for:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, target is obtained and appoints
The mapping relations of business and each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model,
Respectively obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain target
The communication cost of task.
Optionally, second analysis module is specifically used for:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M
When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
Optionally, second analysis module is specifically used for:
Determine the target subtask in the mesh according to default optimization algorithm, the disturbance level and the communication price
The target GPU in calculate node is marked, the default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle
Colony optimization algorithm.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: processor, communication interface, memory and
Communication bus, in which:
The processor, communication interface, memory complete mutual communication by communication bus;
The memory, for storing computer program;
The processor when for executing the program stored on memory, realizes that above-mentioned first aspect is any described
GPU cluster deep learning task parallel method.
Fourth aspect, the embodiment of the present application provide a kind of storage medium, instruction are stored in the storage medium, when it
When running on computers, so as to execute any GPU cluster deep learning task of above-mentioned first aspect parallel for computer
Change method.
5th aspect, the embodiment of the present application provides a kind of computer program product comprising instruction, when it is in computer
When upper operation, so that computer executes any GPU cluster deep learning task parallel method of above-mentioned first aspect.
GPU cluster deep learning task parallel method provided by the embodiments of the present application, device, electronic equipment, storage are situated between
Matter and computer program product comprising instruction, by first analyzing each calculate node of deep learning task and GPU cluster to be processed
Similitude, target computing nodes of the deep learning task to be processed in GPU cluster are determined, by fully considering depth to be processed
The similitude between learning tasks and other tasks is spent, a possibility that reduce calculate node contention for resources, to improve depth
The utilization rate and execution efficiency for spending learning tasks system resource, further according to GPU number of the needs of deep learning task to be processed,
Deep learning task to be processed is divided into multiple target subtasks, analyzes the disturbance level and communication price of target subtask, from
And determine target GPU of the target subtask in target computing nodes, by the disturbance level and communication that consider target subtask
Cost avoids the resource allocation in calculate node on GPU unbalanced, realizes the highly-parallel of deep learning task, improves
The resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.Certainly, implement any of the application
It is not absolutely required to reach above all advantages simultaneously for product or method.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of schematic diagram of the GPU cluster deep learning task parallel method of the embodiment of the present application;
Fig. 2 is a kind of schematic diagram of the GPU cluster deep learning task parallelization device of the embodiment of the present application;
Fig. 3 is a kind of schematic diagram of the electronic equipment of the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
The embodiment of the present application discloses a kind of GPU cluster deep learning task parallel method, device, electronic equipment, deposits
Storage media and computer program product comprising instruction, are illustrated individually below.
The embodiment of the present application provides GPU cluster deep learning task parallel method, is that the application is real referring to Fig. 1, Fig. 1
A kind of schematic diagram for applying the GPU cluster deep learning task parallel method of example, includes the following steps:
Step 110, the task-set information of deep learning task and GPU cluster to be processed, the task of above-mentioned GPU cluster are obtained
Collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster
The mission bit stream for each subtask that GPU is carrying out.
The GPU cluster deep learning task parallel method of the embodiment of the present application can realize by electronic equipment, specifically
, which can be server.
In order to improve the calculated performance of GPU, can it is extending transversely be GPU cluster, i.e., by multiple GPU groups on multiple nodes
At GPU cluster, such GPU cluster, which integrates multiple GPU, can complete complicated calculating task.So being advised greatly in GPU cluster
When mould deep learning appoints training, the deep learning task in GPU cluster can have multiple, and deep learning task to be processed is wherein
Any task, it is assumed that system has p deep learning tasks to be processed, is J by deep learning task definition to be processedi, i ∈
{1,…,p}.It include task-set in GPU cluster, wherein above-mentioned task-set includes in each calculate node waiting list of GPU cluster
Each subtask that each GPU is carrying out in each calculate node of each task, GPU cluster.
Step 120, mission bit stream and each calculate node of above-mentioned GPU cluster etc. of above-mentioned deep learning task to be processed are analyzed
To the mission bit stream of task each in queue, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU cluster respectively calculates section
The similitude of point.
In order to improve the execution efficiency of GPU cluster and the utilization rate of computing resource, need to calculate above-mentioned depth to be processed
The similitude of habit task and each calculate node of above-mentioned GPU cluster, to avoid the interference between computing resource.
In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate
Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed
Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed
The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed
The size of the data set of sample size, above-mentioned deep learning task to be processed that each step of task is analyzed.
In the embodiment of the present application, above-mentioned deep learning task characterization to be processed is described as Ji=(jUCPU,jUhMem,
jThPI/O,jUGPU,jUdMem,jThPPCIe, batch_size, dsize), wherein jUCPURepresent above-mentioned deep learning task to be processed
Maximum CPU usage, jUhMemRepresent host memory utilization rate, the jThP of above-mentioned deep learning task to be processedI/OIn representative
State I/O handling capacity, the jU of deep learning task to be processedGPURepresent above-mentioned deep learning task to be processed GPU utilization rate,
jUdMemRepresent device memory utilization rate, the jThP of above-mentioned deep learning task to be processedPCIeRepresent above-mentioned deep learning to be processed
The bandwidth utilization rate of task, batch_size represent above-mentioned deep learning task to be processed in each training step for analysis
Sample size, dsize represent the size of the data set of above-mentioned deep learning task to be processed.
In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned deep learning task to be processed of analysis and above-mentioned
The mission bit stream of each task in each calculate node waiting list of GPU cluster, respectively obtain above-mentioned deep learning task to be processed and
The similitude of above-mentioned each calculate node of GPU cluster, comprising:
By the task letter in above-mentioned deep learning mission bit stream to be processed and each calculate node waiting list of above-mentioned GPU cluster
Breath inputs in default similitude prediction model, respectively obtain above-mentioned deep learning task to be processed feature vector and above-mentioned GPU
The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately
To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster
The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster
The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree,
cos(θJisjk)JiSjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster
To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it
Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.
Above-mentioned default similitude prediction model can be IASP (Interference-Aware
SimilarityPrediction, the prediction of interference perception similitude) model.
For example, defining deep learning task-metric matrix M to be processedJCharacterize above-mentioned deep learning task to be processed,
Middle matrix MJEvery a line represent an above-mentioned deep learning task to be processed, each column represent above-mentioned deep learning to be processed and appoint
One performance characteristic of business, i.e., be made of cpu resource utilization rate and GPU resource utilization rate.
Obtain above-mentioned deep learning task J to be processediMission bit stream, by the task of above-mentioned deep learning task to be processed
Information standardization simultaneously inserts above-mentioned matrix MJMiddle corresponding line, deep learning task J to be processed to each of queuei, use decimal
It is analyzed it according to amount, two features is arbitrarily selected to obtain its characteristic value from the feature of deep learning task to be processed.Make
With Virtual File System, i.e. proc file system analyzes cpu resource utilization rate, obtains CPU measurement, and service performance analyzes work
Tool obtains GPU measurement, the performance metric that analysis obtains is filled out such as NVIDIA Profiler tool analysis GPU resource utilization rate
Write deep learning task characterization vector J to be processediIn, and by vector JiIt is inserted into matrix MJIn.
Above-mentioned matrix M is predicted using IASP modelJThe characteristic value of middle missing, by the feature of IASP model prediction to upper
State matrix MJIt is filled, obtains complete matrix.The feature of above-mentioned deep learning task to be processed is obtained by IASP model
Vector.Similarly, the mission bit stream of the task in each calculate node waiting list of above-mentioned GPU cluster is inputted into IASP model, obtained
The feature vector of task in above-mentioned each calculate node waiting list of GPU cluster.
Specifically, IASP model be depth collaborative filtering model, as DCF (Deep Collaborative Filtering,
Depth collaborative filtering) model carries out similitude prediction., further, stochastic gradient descent method (SGD) Lai Youhua can be used
DCF model.
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately
To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster
The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster
The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree,
cos(θJjSjk)JisjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster
To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it
Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.For example, Ji
With SjSimilitude be JiWith SjIn each task similitude average value, that is, calculate separately above-mentioned deep learning task to be processed
JiWith the S in each calculate node waiting list of above-mentioned GPU clusterjAfter similitude between middle n task, each similitude is calculated
Average value obtains appointing in above-mentioned i-th deep learning task to be processed and j-th of calculate node waiting list of above-mentioned GPU cluster
The similitude of business.
Because of SjComprising n task, then need to calculate above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster
In j-th of calculate node waiting list in each task between similitude, calculate the average value of all similitudes, obtain
To the task in j-th of calculate node waiting list in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster
Similitude.
Step 130, according to above-mentioned similitude, mesh of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster is determined
Mark calculate node.
According to above-mentioned similitude, similitude is calculate node in the smallest above-mentioned GPU cluster, as above-mentioned depth to be processed
Target computing nodes of the learning tasks in above-mentioned GPU cluster.
Select the smallest calculate node of similitude, can be fought for avoid computing resource, improve GPU cluster execution efficiency and
The utilization rate of computing resource.
Step 140, the GPU number needed according to above-mentioned deep learning task to be processed, by above-mentioned deep learning to be processed
Task is divided into multiple target subtasks.
After above-mentioned deep learning task to be processed is assigned to the target computing nodes in above-mentioned GPU cluster, according to it is above-mentioned to
The GPU number that deep learning task needs is handled, above-mentioned deep learning task to be processed is divided into multiple target subtasks, such as
Deep learning task J to be processedi, define deep learning task to be processed and be divided into multiple target subtasks and beWherein n represents deep learning task J to be processediIn subtask quantity.
Definition target subtask is Ti j, j ∈ { 1 ..., n }, Ti j∈Ji, n indicates that above-mentioned deep learning task to be processed is divided into
For n target subtask, wherein Ti jIndicate j-th of target subtask.
Step 150, each GPU analyzed in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is being held
The mission bit stream of capable each subtask respectively obtains the disturbance level and communication price of each above-mentioned target subtask.
The GPU collection defined in above-mentioned target computing nodes is combined into Gk, k ∈ { 1 ..., m }, wherein m indicates target computing nodes
In shared a m GPU, GkFor k-th of GPU.
Defining target subtask is
Ti j=(tESM,tUL1,tThPL1,tUL2,tThPL2,tUTex,tThPTex,tUDRAM,tThPDRAM,tThPL,tThPS,
Batch_size, sub_dsize), wherein tESM、tUL1、tThPL1、tUL2、tThPL2、tUTex、tThPTex、tUDRAM、tThPDRAM、
tThPLAnd tThPSThe SM efficiency, GPU L1 cache utilization rate, L1 for respectively representing above-mentioned target subtask read handling capacity, GPU
L2 cache utilization rate, L2 read handling capacity, GPU texture caching utilization rate, texture cache read handling capacity, GPU memory usage,
Memory reads handling capacity, global load handling capacity, global storage handling capacity, and sub_dsize indicates the data set of target subtask
Size.
In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned target subtask of analysis and above-mentioned target are calculated
The mission bit stream for each subtask that each GPU in node is carrying out respectively obtains the disturbance level of each above-mentioned target subtask
And communication price, comprising:
Above-mentioned multiple target subtasks are respectively mapped to each GPU in above-mentioned target computing nodes, target is obtained and appoints
The mapping relations of business and each GPU;
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model,
Respectively obtain performance degradation when above-mentioned target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of above-mentioned target subtask is calculated separately according to each above-mentioned performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain above-mentioned target
The communication cost of task.
Define the ratio between deadline when deadline and the isolated operation when subtask is run together with other subtasks
To indicate performance degradation.
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model,
Default capabilities prediction model can be IAPP (Interference-Aware PerformancePrediction, interference perception
Performance prediction) model, specifically, default capabilities prediction model can be DNN (Deep Neural Network depth nerve net
Network) model.
For example, defining target subtask is Ji={ Ti 1,…,Ti n, j ∈ { 1 ..., n }, Ti j∈Ji, n indicate it is above-mentioned wait locate
Reason deep learning task is divided into n target subtask, wherein Ti jIt indicates j-th of target subtask, defines above-mentioned target and calculate
Each GPU collection in node is combined into G={ G1…Gm, wherein m indicates there be m GPU in target computing nodes, by above-mentioned multiple mesh
Mark subtask Ji={ Ti 1,…,Ti nIt is respectively mapped to each GPU in above-mentioned target computing nodes, obtain target subtask and each
The mapping relations of a GPU, definition mapping relations are M (j)=k, j ∈ { 1 ..., n }, and k ∈ { 1 ..., m }, n indicate above-mentioned to be processed
Deep learning task is divided into n target subtask, and m indicates a shared m GPU in target computing nodes, M (j)=k expression the
J target subtask Ti jIt is assigned to k-th of GPU, i.e. Gk。
Specifically, defining subtask-metric matrix MtCharacterize subtask, subtask is by above-mentioned target subtask or each
The subtask that GPU is carrying out.
Each subtask that each GPU is carrying out is by matrix MtIn vector input IAPP model output be vector zt, vector
ztIn performance degradation of each element representation target subtask when being executed jointly with other subtasks.
It is calculated using following formula by target subtask Ti jThe average behavior being assigned on a GPU degrades.
Wherein SlowdownjkIndicate target subtask Ti jIt is assigned in above-mentioned target computing nodes flat on k-th of GPU
Equal performance degradation, numk indicate the subtask quantity executed jointly on k-th of GPU in above-mentioned target computing nodes.
Defining disturbance level isWherein I (M (j)) represents j-th of target
Subtask Ti jMapping relations be M (j) disturbance level, n indicates that deep learning task to be processed is divided into n target subtask.
It will be from GiTo GjCommunication cost be defined as ccij, i ∈ { 1 ..., m }, j ∈ { 1 ..., m }, GiIt indicates in above-mentioned target computing nodes
I-th of GPU, GjIndicate j-th of GPU in above-mentioned target computing nodes.Communication cost can be by the available bandwidth between physics GPU
It is calculated, wherein High Availabitity bandwidth means low communication cost, and low available bandwidth means high communications cost.
By target subtask Ti iWith target subtask Ti jBetween communication requirement be defined as crij, communication requirement, which defines, to be equal to
Data volume needed for updating model.Target subtask T is obtained according to the following formulai iWith target subtask Ti jBetween communication generation
Valence:
Communication cost when C (M) represents mapping relations as M.
Step 160, it according to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, calculates and saves in above-mentioned target
The execution GPU of each above-mentioned target subtask is determined in the GPU of point respectively.
By considering the disturbance level and communication price of target subtask, the resource allocation in calculate node on GPU is avoided
It is unbalanced, the highly-parallel of deep learning task is realized, the resource utilization of GPU cluster is improved, while improving depth
Spend the execution efficiency of learning tasks.
In a kind of possible embodiment, the above-mentioned disturbance level according to each above-mentioned target subtask and above-mentioned communication generation
Valence determines the execution GPU of each above-mentioned target subtask respectively in the GPU of above-mentioned target computing nodes, comprising:
Determine the objective function of target GPU, wherein above-mentioned objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M
When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, above-mentioned mapping relations GPU corresponding when being M is above-mentioned target GPU.
By considering the disturbance level and communication price of target subtask, the resource allocation in calculate node on GPU is avoided
It is unbalanced, the highly-parallel of deep learning task is realized, the resource utilization of GPU cluster is improved, while improving depth
Spend the execution efficiency of learning tasks.
In a kind of possible embodiment, the above-mentioned disturbance level according to each above-mentioned target subtask and above-mentioned communication generation
Valence determines the execution GPU of each above-mentioned target subtask respectively in the GPU of above-mentioned target computing nodes, comprising:
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price
The target GPU in calculate node is marked, above-mentioned default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle
Colony optimization algorithm.
Such as particle swarm optimization algorithm initializes a group particle, particle represents the target subtask to be distributed.Then we
From GPU set G={ G1…GmIn random initializtion particle position and random initializtion particle rapidity, distribute to the every of particle
The fitness value of a dimension is the number of GPU, for example k is k-th of GPU, i.e., particle indicates the mapping of multiple tasks to GPU.
Each particle is by its current locationAnd present speedIt indicates, and each particle both knows about particle and found
Optimum position pbest and population entire so far in global optimum position gbest.The principle of the algorithm is mobile
These particles are to find optimal solution.Each particle position is influenced by its optimum position pbest and overall situation optimum position gbest.
The fitness value by optimizing the fitness function calculating in every generation can be used to update its optimum position in each particle
pbest.In each iterative process, each example can update its speed and position using following formula.
Wherein, ω is the inertia weight for maintaining particle, c1And c2It is accelerator coefficient, r1And r2Be between 0 to 1 with
Machine number.
Use objective functionThe assessment of each particle is executed, then according to following two
Formula updates speed and the position of each particle:
Iteration is executed until reaching specified the number of iterations or meeting iteration precision, finds optimal GPU.
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price
The target GPU in calculate node is marked, the resource utilization for improving GPU cluster can be maximized, while maximizing and improving depth
The execution efficiency of habit task.
By first analyzing the similitude of each calculate node of deep learning task and GPU cluster to be processed, depth to be processed is determined
Target computing nodes of the learning tasks in GPU cluster are spent, by fully considering deep learning task to be processed and other tasks
Between similitude, a possibility that reduce calculate node contention for resources, to improve deep learning task system resource
Utilization rate and execution efficiency, further according to GPU number of the needs of deep learning task to be processed, by deep learning task to be processed
It is divided into multiple target subtasks, the disturbance level and communication price of target subtask is analyzed, so that it is determined that target subtask is in mesh
The target GPU in calculate node is marked, by considering the disturbance level and communication price of target subtask, is avoided in calculate node
Resource allocation on GPU is unbalanced, realizes the highly-parallel of deep learning task, improves the utilization of resources of GPU cluster
Rate, while improving the execution efficiency of deep learning task.
The embodiment of the present application also provides a kind of devices, and referring to fig. 2, Fig. 2 is the GPU cluster depth of the embodiment of the present application
A kind of schematic diagram of habit task parallelization device, above-mentioned apparatus include:
Acquisition module 210, for obtaining the task-set information of deep learning task and GPU cluster to be processed, above-mentioned GPU collection
Group task-set information include in each calculate node waiting list of GPU cluster the mission bit stream of each task, GPU cluster respectively calculate section
The mission bit stream for each subtask that each GPU is carrying out in point;
First analysis module 220, for analyze above-mentioned deep learning task to be processed mission bit stream and above-mentioned GPU cluster
The mission bit stream of each task in each calculate node waiting list, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU
The similitude of each calculate node of cluster;
Calculate node determining module 230, for determining above-mentioned deep learning task to be processed upper according to above-mentioned similitude
State the target computing nodes in GPU cluster;
Subtask module 240, the GPU number for being needed according to above-mentioned deep learning task to be processed, by above-mentioned wait locate
Reason deep learning task is divided into multiple target subtasks;
Second analysis module 250, in the mission bit stream and above-mentioned target computing nodes for analyzing above-mentioned target subtask
The mission bit stream of each subtask that is carrying out of each GPU, respectively obtain the disturbance level and communication of each above-mentioned target subtask
Cost;
GPU determining module 260, for according to each above-mentioned target subtask disturbance level and above-mentioned communication price, upper
State the execution GPU for determining each above-mentioned target subtask in the GPU of target computing nodes respectively.
In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate
Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed
Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed
The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed
At least one of the sample size of task each step analysis, size of data set of above-mentioned deep learning task to be processed.
In a kind of possible embodiment, above-mentioned first analysis module 220 is specifically used for:
By the task letter in above-mentioned deep learning mission bit stream to be processed and each calculate node waiting list of above-mentioned GPU cluster
Breath inputs in default similitude prediction model, respectively obtain above-mentioned deep learning task to be processed feature vector and above-mentioned GPU
The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately
To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster
The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster
The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree,
cos(θJiSjk)JiSjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster
To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it
Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.One kind can
In the embodiment of energy, above-mentioned second analysis module 250 is specifically used for:
Above-mentioned multiple target subtasks are respectively mapped to each GPU in above-mentioned target computing nodes, target is obtained and appoints
The mapping relations of business and each GPU;
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model,
Respectively obtain performance degradation when above-mentioned target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of above-mentioned target subtask is calculated separately according to each above-mentioned performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain above-mentioned target
The communication cost of task.
In a kind of possible embodiment, above-mentioned second analysis module 250 is specifically used for:
Determine the objective function of target GPU, wherein above-mentioned objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represent mapping relations as
Communication cost when M, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, above-mentioned mapping relations GPU corresponding when being M is above-mentioned target GPU.
In a kind of possible embodiment, above-mentioned second analysis module 250 is specifically used for:
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price
The target GPU in calculate node is marked, above-mentioned default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle
Colony optimization algorithm.
The embodiment of the present application also provides a kind of electronic equipment, referring to Fig. 3, comprising: processor 310, communication interface 320,
Memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 are complete by communication bus 340
At mutual communication,
Above-mentioned memory 330, for storing computer program;
Above-mentioned processor 310 realizes following steps when for executing the computer program of the above-mentioned storage of memory 330:
Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of above-mentioned GPU cluster
Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster
The mission bit stream of capable each subtask;
It analyzes in the mission bit stream and each calculate node waiting list of above-mentioned GPU cluster of above-mentioned deep learning task to be processed
The mission bit stream of each task respectively obtains the similar of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster
Property;
According to above-mentioned similitude, determine that target of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster calculates section
Point;
According to the GPU number that above-mentioned deep learning task to be processed needs, above-mentioned deep learning task to be processed is divided into
Multiple target subtasks;
Analyze each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out
The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask;
According to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, in the GPU of above-mentioned target computing nodes
The middle execution GPU for determining each above-mentioned target subtask respectively.
For example, the processor 310 of electronic equipment includes the GPU cluster that centralized control unit and multiple GPU are formed, wherein
GPU cluster includes multiple calculate nodes, and each calculate node is made of multiple GPU, centralized control unit include data collector,
Cluster management unit, node management unit, electronic equipment are used for multiple deep learning task parallel processings in GPU cluster.Its
In, data collector obtains the task-set information of deep learning task and GPU cluster to be processed, and cluster management unit analysis is above-mentioned
The task letter of each task in the mission bit stream of deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster
Breath, respectively obtains the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster, and according to above-mentioned phase
Like property, determine target computing nodes of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster, node management unit according to
Above-mentioned deep learning task to be processed is divided into multiple target and appointed by the GPU number that above-mentioned deep learning task to be processed needs
Business, then analyzes each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out
The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask, finally according to each above-mentioned mesh
The disturbance level of mark subtask and above-mentioned communication price, determine each above-mentioned target respectively in the GPU of above-mentioned target computing nodes
The execution GPU of subtask.
Optionally, processor 310 when for executing the program stored on memory 330, can also be realized any of the above-described
GPU cluster deep learning task parallel method.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component
Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard
Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just
It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy
The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also
To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit,
CPU), GPU (Graphic Processing Unit, image processor), network processing unit (Network Processor, NP)
Deng;It can also be digital signal processor (Digital Signal Processing, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components.
In the embodiment of the present application, additionally provide a kind of storage medium, this can be stored with instruction in storage medium, when its
When being run on computer, so that computer executes any GPU cluster deep learning task parallel method in above-described embodiment.
In the embodiment of the present application, a kind of computer readable storage medium is additionally provided, the computer readable storage medium
In be stored with instruction, when run on a computer, so that computer executes any above-mentioned GPU cluster in above-described embodiment
Deep learning task parallel method.
It should be noted that, in this document, as long as the technical characteristic non-contradiction in each optinal plan can combine and carry out shape
At scheme, these schemes are in range disclosed in the present application.Relational terms such as first and second and the like are used merely to
It distinguishes one entity or operation from another entity or operation, without necessarily requiring or implying these entities or behaviour
There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any other variant
It is intended to non-exclusive inclusion, so that including that the process, method, article or equipment of a series of elements not only includes
Those elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or
The intrinsic element of person's equipment.In the absence of more restrictions, the element limited by sentence "including a ...", not
There is also other identical elements in the process, method, article or equipment for including above-mentioned element for exclusion.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device,
For the embodiment of electronic equipment and storage medium, since it is substantially similar to the method embodiment, so be described relatively simple,
The relevent part can refer to the partial explaination of embodiments of method.
The preferred embodiment that above are only the application above is not intended to limit the protection scope of the application.It is all
Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application
It is interior.
Claims (10)
1. a kind of GPU cluster deep learning task parallel method characterized by comprising
The task-set information of deep learning task and GPU cluster to be processed is obtained, the task-set information of the GPU cluster includes
The mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is carrying out in each calculate node of GPU cluster
Subtask each mission bit stream;
Each is analyzed in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed
The mission bit stream of business respectively obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster;
According to the similitude, target computing nodes of the deep learning task to be processed in the GPU cluster are determined;
According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into multiple
Target subtask;
Analyze each subtask that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out
Mission bit stream, respectively obtain the disturbance level and communication price of each target subtask;
According to the disturbance level of each target subtask and the communication price, divide in the GPU of the target computing nodes
The execution GPU of each target subtask is not determined.
2. the method according to claim 1, wherein the mission bit stream of the deep learning task to be processed, packet
It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed
Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to
Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate
Manage the size of the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed.
3. the method according to claim 1, wherein the task of the analysis deep learning task to be processed
The mission bit stream of each task, respectively obtains the depth to be processed in information and each calculate node waiting list of the GPU cluster
The similitude of learning tasks and each calculate node of the GPU cluster, comprising:
By the task of each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster
Information input is preset in similitude prediction model, and the feature vector of the deep learning task to be processed and described is respectively obtained
The feature vector of each task in each calculate node waiting list of GPU cluster;
According to the following formula, it calculates separately the deep learning task to be processed and each calculate node of the GPU cluster waits team
The similitude between each task in column:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of the calculating section in GPU cluster
The feature vector of task in waiting list is put, wherein SjInclude n task, SjkRepresent j-th of calculate node etc. in GPU cluster
To the feature vector of k-th of task in queue, k ∈ { 1 ..., n }, θJiSjkIt is the JiWith the SjkAngle degree, cos
(θJiSjk)JiSjkJ-th of the calculate node represented in described i-th deep learning task to be processed and the GPU cluster waits team
The similitude of k-th of task in column, | | Ji| | it is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According between each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster
Similitude obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.
4. the method according to claim 1, wherein mission bit stream and the institute of the analysis target subtask
The mission bit stream for stating each subtask that each GPU in target computing nodes is carrying out respectively obtains each target subtask
Disturbance level and communication price, comprising:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, obtain target subtask with
The mapping relations of each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, respectively
Obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain the target subtask
Communication cost.
5. the method according to claim 1, wherein the disturbance level according to each target subtask and
The communication price determines the execution GPU of each target subtask respectively in the GPU of the target computing nodes, comprising:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, when C (M) represents mapping relations as M
Communication cost, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
6. -5 any method according to claim 1, which is characterized in that the interference according to each target subtask
Grade and the communication price determine the execution of each target subtask respectively in the GPU of the target computing nodes
GPU, comprising:
Determine the target subtask in the target meter according to default optimization algorithm, the disturbance level and the communication price
Target GPU in operator node, the default optimization algorithm are that ant group algorithm, genetic algorithm, simulated annealing or population are excellent
Change algorithm.
7. a kind of GPU cluster deep learning task parallelization device characterized by comprising
Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster is appointed
Business collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster
The mission bit stream for each subtask that GPU is carrying out;
First analysis module, mission bit stream and the GPU cluster for analyzing the deep learning task to be processed respectively calculate
The mission bit stream of each task in node waiting list, respectively obtains the deep learning task to be processed and the GPU cluster is each
The similitude of calculate node;
Calculate node determining module, for determining the deep learning task to be processed in the GPU collection according to the similitude
Target computing nodes in group;
Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed
Habit task is divided into multiple target subtasks;
Second analysis module, each GPU in mission bit stream and the target computing nodes for analyzing the target subtask
The mission bit stream for each subtask being carrying out respectively obtains the disturbance level and communication price of each target subtask;
GPU determining module, for according to each target subtask disturbance level and the communication price, in the target meter
The execution GPU of each target subtask is determined in the GPU of operator node respectively.
8. device according to claim 7, which is characterized in that the mission bit stream of the deep learning task to be processed, packet
It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed
Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to
Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate
In the size for managing the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed
It is at least one.
9. a kind of electronic equipment characterized by comprising processor, communication interface, memory and communication bus, wherein
The processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes GPU collection of any of claims 1-6
Group's deep learning task parallel method.
10. a kind of storage medium, which is characterized in that be stored with computer program, the computer program in the storage medium
GPU cluster deep learning task parallel method described in any one of claims 1-6 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910675587.9A CN110399222B (en) | 2019-07-25 | 2019-07-25 | GPU cluster deep learning task parallelization method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910675587.9A CN110399222B (en) | 2019-07-25 | 2019-07-25 | GPU cluster deep learning task parallelization method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399222A true CN110399222A (en) | 2019-11-01 |
CN110399222B CN110399222B (en) | 2022-01-21 |
Family
ID=68325235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910675587.9A Active CN110399222B (en) | 2019-07-25 | 2019-07-25 | GPU cluster deep learning task parallelization method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399222B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104289A (en) * | 2019-12-25 | 2020-05-05 | 创新奇智(上海)科技有限公司 | System and method for checking efficiency of GPU (graphics processing Unit) cluster |
CN111258735A (en) * | 2020-01-16 | 2020-06-09 | 中国人民解放军国防科技大学 | Deep learning task scheduling method supporting QoS (quality of service) perception of user |
CN111309479A (en) * | 2020-02-14 | 2020-06-19 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for realizing task parallel processing |
CN111866187A (en) * | 2020-06-30 | 2020-10-30 | 中科院计算所西部高等技术研究院 | Task scheduling method of distributed deep learning reasoning cloud platform |
CN111913799A (en) * | 2020-07-14 | 2020-11-10 | 北京华夏启信科技有限公司 | Video stream online analysis task scheduling method and computer equipment |
CN112416585A (en) * | 2020-11-20 | 2021-02-26 | 南京大学 | GPU resource management and intelligent scheduling method for deep learning |
CN112584143A (en) * | 2020-12-02 | 2021-03-30 | 浙江大华技术股份有限公司 | Video coding method, device and system and computer readable storage medium |
CN112965809A (en) * | 2019-12-12 | 2021-06-15 | 深圳市优必选科技股份有限公司 | Deep learning task processing system and method |
CN113194086A (en) * | 2021-04-27 | 2021-07-30 | 新华三信息安全技术有限公司 | Anti-attack method and device |
CN113377520A (en) * | 2021-07-07 | 2021-09-10 | 北京百度网讯科技有限公司 | Resource scheduling method, device, equipment and storage medium |
CN113900793A (en) * | 2021-07-29 | 2022-01-07 | 苏州浪潮智能科技有限公司 | Server cluster and deep learning aggregate communication system and method thereof |
CN114116220A (en) * | 2021-11-29 | 2022-03-01 | 苏州浪潮智能科技有限公司 | GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium |
CN114138449A (en) * | 2021-12-14 | 2022-03-04 | 河南省儿童医院郑州儿童医院 | Rehabilitation training system based on virtual reality |
CN114285766A (en) * | 2021-08-20 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Network bandwidth detection method and device, electronic equipment and storage medium |
WO2022116142A1 (en) * | 2020-12-04 | 2022-06-09 | 深圳大学 | Resource scheduling method based on graph neural network |
CN115248728A (en) * | 2022-09-21 | 2022-10-28 | 之江实验室 | Distributed training task scheduling method, system and device for intelligent computing |
CN115373861A (en) * | 2022-10-26 | 2022-11-22 | 小米汽车科技有限公司 | GPU resource scheduling method and device, electronic equipment and storage medium |
WO2024022046A1 (en) * | 2022-07-28 | 2024-02-01 | 华为技术有限公司 | Deep learning system and method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050125369A1 (en) * | 2003-12-09 | 2005-06-09 | Microsoft Corporation | System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit |
US20140201741A1 (en) * | 2010-07-26 | 2014-07-17 | Microsoft Corporation | Workload interference estimation and performance optimization |
US20150301862A1 (en) * | 2012-09-14 | 2015-10-22 | International Business Machines Corporation | Preferential cpu utilization for tasks |
CN105900064A (en) * | 2014-11-19 | 2016-08-24 | 华为技术有限公司 | Method and apparatus for scheduling data flow task |
CN107045456A (en) * | 2016-02-05 | 2017-08-15 | 华为技术有限公司 | A kind of resource allocation methods and explorer |
CN107135257A (en) * | 2017-04-28 | 2017-09-05 | 东方网力科技股份有限公司 | Task is distributed in a kind of node cluster method, node and system |
CN107329828A (en) * | 2017-06-26 | 2017-11-07 | 华中科技大学 | A kind of data flow programmed method and system towards CPU/GPU isomeric groups |
CN107766148A (en) * | 2017-08-31 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of isomeric group and task processing method and device |
CN109101339A (en) * | 2018-08-15 | 2018-12-28 | 北京邮电大学 | Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group |
CN109936604A (en) * | 2017-12-18 | 2019-06-25 | 北京图森未来科技有限公司 | A kind of resource regulating method, device and system |
-
2019
- 2019-07-25 CN CN201910675587.9A patent/CN110399222B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050125369A1 (en) * | 2003-12-09 | 2005-06-09 | Microsoft Corporation | System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit |
US20140201741A1 (en) * | 2010-07-26 | 2014-07-17 | Microsoft Corporation | Workload interference estimation and performance optimization |
US20150301862A1 (en) * | 2012-09-14 | 2015-10-22 | International Business Machines Corporation | Preferential cpu utilization for tasks |
CN105900064A (en) * | 2014-11-19 | 2016-08-24 | 华为技术有限公司 | Method and apparatus for scheduling data flow task |
CN107045456A (en) * | 2016-02-05 | 2017-08-15 | 华为技术有限公司 | A kind of resource allocation methods and explorer |
CN107135257A (en) * | 2017-04-28 | 2017-09-05 | 东方网力科技股份有限公司 | Task is distributed in a kind of node cluster method, node and system |
CN107329828A (en) * | 2017-06-26 | 2017-11-07 | 华中科技大学 | A kind of data flow programmed method and system towards CPU/GPU isomeric groups |
CN107766148A (en) * | 2017-08-31 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of isomeric group and task processing method and device |
CN109936604A (en) * | 2017-12-18 | 2019-06-25 | 北京图森未来科技有限公司 | A kind of resource regulating method, device and system |
CN109101339A (en) * | 2018-08-15 | 2018-12-28 | 北京邮电大学 | Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group |
Non-Patent Citations (4)
Title |
---|
GEORGE TEODORO ET.AL: ""Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms"", 《ARXIV》 * |
HAITAO ZHANG ET.AL: ""Learning Driven Parallelization for Large-Scale Video Workload in Hybrid CPU-GPU Cluster"", 《ICPP 2018》 * |
JEON, M. ET.AL: ""Multi-tenant GPU clusters for deep learning workloads"", 《TECHNICAL REPORT, MSR-TR-2018》 * |
WEI QIAO ET.AL: ""DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment"", 《ITM WEB OF CONFERENCES》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112965809A (en) * | 2019-12-12 | 2021-06-15 | 深圳市优必选科技股份有限公司 | Deep learning task processing system and method |
CN111104289A (en) * | 2019-12-25 | 2020-05-05 | 创新奇智(上海)科技有限公司 | System and method for checking efficiency of GPU (graphics processing Unit) cluster |
CN111258735A (en) * | 2020-01-16 | 2020-06-09 | 中国人民解放军国防科技大学 | Deep learning task scheduling method supporting QoS (quality of service) perception of user |
CN111309479A (en) * | 2020-02-14 | 2020-06-19 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for realizing task parallel processing |
US11954522B2 (en) | 2020-02-14 | 2024-04-09 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for processing tasks in parallel, device and storage medium |
CN111866187A (en) * | 2020-06-30 | 2020-10-30 | 中科院计算所西部高等技术研究院 | Task scheduling method of distributed deep learning reasoning cloud platform |
CN111866187B (en) * | 2020-06-30 | 2022-10-04 | 中科院计算所西部高等技术研究院 | Task scheduling method for distributed deep learning reasoning cloud platform |
CN111913799B (en) * | 2020-07-14 | 2024-04-19 | 北京华夏启信科技有限公司 | Video stream online analysis task scheduling method and computer equipment |
CN111913799A (en) * | 2020-07-14 | 2020-11-10 | 北京华夏启信科技有限公司 | Video stream online analysis task scheduling method and computer equipment |
CN112416585B (en) * | 2020-11-20 | 2024-03-15 | 南京大学 | Deep learning-oriented GPU resource management and intelligent scheduling method |
CN112416585A (en) * | 2020-11-20 | 2021-02-26 | 南京大学 | GPU resource management and intelligent scheduling method for deep learning |
CN112584143A (en) * | 2020-12-02 | 2021-03-30 | 浙江大华技术股份有限公司 | Video coding method, device and system and computer readable storage medium |
CN112584143B (en) * | 2020-12-02 | 2022-09-06 | 浙江大华技术股份有限公司 | Video coding method, device and system and computer readable storage medium |
WO2022116142A1 (en) * | 2020-12-04 | 2022-06-09 | 深圳大学 | Resource scheduling method based on graph neural network |
CN113194086B (en) * | 2021-04-27 | 2022-05-27 | 新华三信息安全技术有限公司 | Anti-attack method and device |
CN113194086A (en) * | 2021-04-27 | 2021-07-30 | 新华三信息安全技术有限公司 | Anti-attack method and device |
CN113377520A (en) * | 2021-07-07 | 2021-09-10 | 北京百度网讯科技有限公司 | Resource scheduling method, device, equipment and storage medium |
CN113900793A (en) * | 2021-07-29 | 2022-01-07 | 苏州浪潮智能科技有限公司 | Server cluster and deep learning aggregate communication system and method thereof |
CN113900793B (en) * | 2021-07-29 | 2023-11-10 | 苏州浪潮智能科技有限公司 | Server cluster and deep learning aggregate communication system and method thereof |
CN114285766A (en) * | 2021-08-20 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Network bandwidth detection method and device, electronic equipment and storage medium |
CN114285766B (en) * | 2021-08-20 | 2023-06-13 | 腾讯科技(深圳)有限公司 | Network bandwidth detection method and device, electronic equipment and storage medium |
CN114116220A (en) * | 2021-11-29 | 2022-03-01 | 苏州浪潮智能科技有限公司 | GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium |
CN114138449A (en) * | 2021-12-14 | 2022-03-04 | 河南省儿童医院郑州儿童医院 | Rehabilitation training system based on virtual reality |
WO2024022046A1 (en) * | 2022-07-28 | 2024-02-01 | 华为技术有限公司 | Deep learning system and method |
CN115248728A (en) * | 2022-09-21 | 2022-10-28 | 之江实验室 | Distributed training task scheduling method, system and device for intelligent computing |
CN115373861A (en) * | 2022-10-26 | 2022-11-22 | 小米汽车科技有限公司 | GPU resource scheduling method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110399222B (en) | 2022-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110399222A (en) | GPU cluster deep learning task parallel method, device and electronic equipment | |
Mapetu et al. | Low-time complexity and low-cost binary particle swarm optimization algorithm for task scheduling and load balancing in cloud computing | |
CN1956457B (en) | Method and apparatus for arranging mesh work in mesh computing system | |
CN110389820B (en) | Private cloud task scheduling method for resource prediction based on v-TGRU model | |
US9239734B2 (en) | Scheduling method and system, computing grid, and corresponding computer-program product | |
CN105373432B (en) | A kind of cloud computing resource scheduling method based on virtual resource status predication | |
Abdel‐Basset et al. | IEGA: an improved elitism‐based genetic algorithm for task scheduling problem in fog computing | |
You et al. | Comprehensive workload analysis and modeling of a petascale supercomputer | |
Chen et al. | Scheduling independent tasks in cloud environment based on modified differential evolution | |
Li et al. | An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters | |
Rani et al. | An efficient and scalable hybrid task scheduling approach for cloud environment | |
CN108427602B (en) | Distributed computing task cooperative scheduling method and device | |
Ding et al. | Kubernetes-oriented microservice placement with dynamic resource allocation | |
CN115168027A (en) | Calculation power resource measurement method based on deep reinforcement learning | |
CN113553160A (en) | Task scheduling method and system for edge computing node of artificial intelligence Internet of things | |
Hu et al. | Improved heuristic job scheduling method to enhance throughput for big data analytics | |
Xilin et al. | Resource allocation optimization of equipment development task based on MOPSO algorithm | |
CN112000460A (en) | Service capacity expansion method based on improved Bayesian algorithm and related equipment | |
CN117349026B (en) | Distributed computing power scheduling system for AIGC model training | |
Ghafari et al. | E-AVOA-TS: Enhanced African vultures optimization algorithm-based task scheduling strategy for fog–cloud computing | |
Li et al. | Dynamic data replacement and adaptive scheduling policies in spark | |
Zhou et al. | Stability property of clouds and cooperative scheduling policies on multiple types of resources in cloud computing | |
CN116225708A (en) | GPU resource scheduling method and device | |
Yassir et al. | Graph-based model and algorithm for minimising big data movement in a cloud environment | |
Yu | [Retracted] Research on Optimization Strategy of Task Scheduling Software Based on Genetic Algorithm in Cloud Computing Environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |