CN110399222A - GPU cluster deep learning task parallel method, device and electronic equipment - Google Patents

GPU cluster deep learning task parallel method, device and electronic equipment Download PDF

Info

Publication number
CN110399222A
CN110399222A CN201910675587.9A CN201910675587A CN110399222A CN 110399222 A CN110399222 A CN 110399222A CN 201910675587 A CN201910675587 A CN 201910675587A CN 110399222 A CN110399222 A CN 110399222A
Authority
CN
China
Prior art keywords
gpu
deep learning
processed
task
learning task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910675587.9A
Other languages
Chinese (zh)
Other versions
CN110399222B (en
Inventor
张海涛
耿欣
马华东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910675587.9A priority Critical patent/CN110399222B/en
Publication of CN110399222A publication Critical patent/CN110399222A/en
Application granted granted Critical
Publication of CN110399222B publication Critical patent/CN110399222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

GPU cluster deep learning task parallel method provided by the embodiments of the present application, device and electronic equipment, it is related to Internet technical field, by the similitude for first analyzing each calculate node of deep learning task and GPU cluster to be processed, determine target computing nodes of the deep learning task to be processed in GPU cluster, a possibility that reduce calculate node contention for resources, to improve the utilization rate and execution efficiency of deep learning task system resource, further according to GPU number of the needs of deep learning task to be processed, deep learning task to be processed is divided into multiple target subtasks, analyze the disturbance level and communication price of target subtask, so that it is determined that target GPU of the target subtask in target computing nodes, it is unbalanced to avoid the resource allocation on GPU in calculate node, realize deep learning task Highly-parallel, improves the resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.

Description

GPU cluster deep learning task parallel method, device and electronic equipment
Technical field
This application involves Internet technical fields, more particularly to GPU cluster deep learning task parallel method, device And electronic equipment.
Background technique
With deepening continuously for deep learning research, depth learning technology is in computer vision, speech recognition, text-processing Equal fields achieve great successes, bring great convenience for people's lives.However, complicated neural network mould More stringent requirements are proposed to computing capability for the data of type and substantial amounts.GPU (Graphic Processing Unit, image Processor) the multiple GPU computing resources of cluster integration, it is provided for computation-intensive deep learning task powerful, efficient parallel Computing capability efficiently solves the calculating demand of multiple deep learning tasks.
However when deep learning task is run in the GPU cloud platform of resource-sharing, execution efficiency can be total to by other It is influenced with interference caused by the resource contention between execution task.Therefore for the deep learning task in GPU cluster, how Demand according to the information of task and task to resource carries out task parallelization scheduling, realizes to GPU cluster interior joint and section The reasonable utilization of multiple GPU resources on point promotes the place of entire calculating task for optimizing the execution time of deep learning task Rationality energy, the resource utilization for improving system are most important.
The GPU cluster of mainstream mainly uses traditional scheduler (such as: Kubernetes, Yarn) to dispatch depth at present Learning tasks.By counting the service condition of whole resource, GPU use is reasonably allocated resources to, and ensure the life of GPU There are enough resources in period to guarantee its operation.
Although this method realizes the parallelization of deep learning task to a certain extent, this method mainly considers to provide The service condition in source does not account for the physical characteristic of resource and the characteristic of task itself, can not achieve the height of deep learning task Parallelization is imitated, the execution efficiency of deep learning workload can be reduced.Meanwhile this method does not support the fine granularity multitask of GPU Distribution, cannot make full use of the GPU resource on node, will affect the efficient execution of deep learning task, reduce the GPU benefit of node With rate, to influence the resource utilization of GPU cluster.
Summary of the invention
Be designed to provide the GPU cluster deep learning task parallel method, device, electronics of the embodiment of the present application are set Standby, storage medium and computer program product comprising instruction realize the highly-parallel of deep learning task, improve GPU The execution efficiency of the utilization rate of GPU resource and deep learning task in cluster.
Specific technical solution is as follows:
In a first aspect, the embodiment of the present application provides a kind of GPU cluster deep learning task parallel method, comprising:
Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of the GPU cluster Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster The mission bit stream of capable each subtask;
It analyzes in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed The mission bit stream of each task respectively obtains the similar of the deep learning task to be processed and each calculate node of the GPU cluster Property;
According to the similitude, determine that target of the deep learning task to be processed in the GPU cluster calculates section Point;
According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into Multiple target subtasks;
Analyze each son that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each target subtask;
According to the disturbance level of each target subtask and the communication price, in the GPU of the target computing nodes The middle execution GPU for determining each target subtask respectively.
Optionally, the mission bit stream of information deep learning task to be processed, comprising: the deep learning to be processed is appointed The maximum CPU usage of business, the host memory utilization rate of the deep learning task to be processed, the deep learning to be processed are appointed The I/O handling capacity of business, the GPU utilization rate of the deep learning task to be processed, the equipment of the deep learning task to be processed Memory usage, the bandwidth utilization rate of the deep learning task to be processed, each step of deep learning task to be processed The sample size of analysis, the deep learning task to be processed data set size.
Optionally, the mission bit stream of the analysis deep learning task to be processed and the GPU cluster respectively calculate section The mission bit stream of each task, respectively obtains the deep learning task to be processed and the GPU cluster is respectively counted in point waiting list The similitude of operator node, comprising:
By each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Mission bit stream inputs in default similitude prediction model, described in the feature vector for respectively obtaining the deep learning task to be processed The feature vector of each task in each calculate node waiting list of GPU cluster;
According to the following formula, the deep learning task to be processed and each calculate node of the GPU cluster etc. are calculated separately To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is the JiWith the SjkAngle degree, cos(θJisjk)JisjkRepresent j-th of calculate node etc. in described i-th deep learning task to be processed and the GPU cluster To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.
Optionally, each GPU in the mission bit stream and the target computing nodes of the analysis target subtask is being just Mission bit stream in each subtask of execution respectively obtains the disturbance level and communication price of each target subtask, comprising:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, target is obtained and appoints The mapping relations of business and each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain target The communication cost of task.
Optionally, the disturbance level according to each target subtask and the communication price, in the target meter The execution GPU of each target subtask is determined in the GPU of operator node respectively, comprising:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
Optionally, the disturbance level according to each target subtask and the communication price, in the target meter The execution GPU of each target subtask is determined in the GPU of operator node respectively, comprising:
Determine the target subtask in the mesh according to default optimization algorithm, the disturbance level and the communication price The target GPU in calculate node is marked, the default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.
Second aspect, the embodiment of the present application provide a kind of GPU cluster deep learning task parallelization device, comprising:
Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster Task-set information include the mission bit stream of each task, each calculate node of GPU cluster in each calculate node waiting list of GPU cluster In the mission bit stream of each subtask that is carrying out of each GPU;
First analysis module, the mission bit stream and the GPU cluster for analyzing the deep learning task to be processed are each The mission bit stream of each task in calculate node waiting list respectively obtains the deep learning task to be processed and the GPU collection The similitude of each calculate node of group;
Calculate node determining module, for determining the deep learning task to be processed described according to the similitude Target computing nodes in GPU cluster;
Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed Degree learning tasks are divided into multiple target subtasks;
Second analysis module, it is each in the mission bit stream and the target computing nodes for analyzing the target subtask The mission bit stream for each subtask that GPU is carrying out respectively obtains the disturbance level and communication price of each target subtask;
GPU determining module, for according to each target subtask disturbance level and the communication price, in the mesh Mark the execution GPU for determining each target subtask in the GPU of calculate node respectively.
Optionally, the mission bit stream of the deep learning task to be processed, comprising: the deep learning task to be processed Maximum CPU usage, the host memory utilization rate of the deep learning task to be processed, the deep learning task to be processed I/O handling capacity, the GPU utilization rate of the deep learning task to be processed, the device memory of the deep learning task to be processed Utilization rate, the bandwidth utilization rate of the deep learning task to be processed, each step analysis of deep learning task to be processed Sample size, at least one of the size of data set of the deep learning task to be processed.
Optionally, first analysis module is specifically used for:
By the task letter in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Breath inputs in default similitude prediction model, respectively obtain the deep learning task to be processed feature vector and the GPU The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, the deep learning task to be processed and each calculate node of the GPU cluster etc. are calculated separately To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is the JiWith the SjkAngle degree, cos(θJiSjk)JisjkRepresent j-th of calculate node etc. in described i-th deep learning task to be processed and the GPU cluster To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.Optionally, Second analysis module is specifically used for:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, target is obtained and appoints The mapping relations of business and each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain target The communication cost of task.
Optionally, second analysis module is specifically used for:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
Optionally, second analysis module is specifically used for:
Determine the target subtask in the mesh according to default optimization algorithm, the disturbance level and the communication price The target GPU in calculate node is marked, the default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: processor, communication interface, memory and Communication bus, in which:
The processor, communication interface, memory complete mutual communication by communication bus;
The memory, for storing computer program;
The processor when for executing the program stored on memory, realizes that above-mentioned first aspect is any described GPU cluster deep learning task parallel method.
Fourth aspect, the embodiment of the present application provide a kind of storage medium, instruction are stored in the storage medium, when it When running on computers, so as to execute any GPU cluster deep learning task of above-mentioned first aspect parallel for computer Change method.
5th aspect, the embodiment of the present application provides a kind of computer program product comprising instruction, when it is in computer When upper operation, so that computer executes any GPU cluster deep learning task parallel method of above-mentioned first aspect.
GPU cluster deep learning task parallel method provided by the embodiments of the present application, device, electronic equipment, storage are situated between Matter and computer program product comprising instruction, by first analyzing each calculate node of deep learning task and GPU cluster to be processed Similitude, target computing nodes of the deep learning task to be processed in GPU cluster are determined, by fully considering depth to be processed The similitude between learning tasks and other tasks is spent, a possibility that reduce calculate node contention for resources, to improve depth The utilization rate and execution efficiency for spending learning tasks system resource, further according to GPU number of the needs of deep learning task to be processed, Deep learning task to be processed is divided into multiple target subtasks, analyzes the disturbance level and communication price of target subtask, from And determine target GPU of the target subtask in target computing nodes, by the disturbance level and communication that consider target subtask Cost avoids the resource allocation in calculate node on GPU unbalanced, realizes the highly-parallel of deep learning task, improves The resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.Certainly, implement any of the application It is not absolutely required to reach above all advantages simultaneously for product or method.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of schematic diagram of the GPU cluster deep learning task parallel method of the embodiment of the present application;
Fig. 2 is a kind of schematic diagram of the GPU cluster deep learning task parallelization device of the embodiment of the present application;
Fig. 3 is a kind of schematic diagram of the electronic equipment of the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
The embodiment of the present application discloses a kind of GPU cluster deep learning task parallel method, device, electronic equipment, deposits Storage media and computer program product comprising instruction, are illustrated individually below.
The embodiment of the present application provides GPU cluster deep learning task parallel method, is that the application is real referring to Fig. 1, Fig. 1 A kind of schematic diagram for applying the GPU cluster deep learning task parallel method of example, includes the following steps:
Step 110, the task-set information of deep learning task and GPU cluster to be processed, the task of above-mentioned GPU cluster are obtained Collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster The mission bit stream for each subtask that GPU is carrying out.
The GPU cluster deep learning task parallel method of the embodiment of the present application can realize by electronic equipment, specifically , which can be server.
In order to improve the calculated performance of GPU, can it is extending transversely be GPU cluster, i.e., by multiple GPU groups on multiple nodes At GPU cluster, such GPU cluster, which integrates multiple GPU, can complete complicated calculating task.So being advised greatly in GPU cluster When mould deep learning appoints training, the deep learning task in GPU cluster can have multiple, and deep learning task to be processed is wherein Any task, it is assumed that system has p deep learning tasks to be processed, is J by deep learning task definition to be processedi, i ∈ {1,…,p}.It include task-set in GPU cluster, wherein above-mentioned task-set includes in each calculate node waiting list of GPU cluster Each subtask that each GPU is carrying out in each calculate node of each task, GPU cluster.
Step 120, mission bit stream and each calculate node of above-mentioned GPU cluster etc. of above-mentioned deep learning task to be processed are analyzed To the mission bit stream of task each in queue, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU cluster respectively calculates section The similitude of point.
In order to improve the execution efficiency of GPU cluster and the utilization rate of computing resource, need to calculate above-mentioned depth to be processed The similitude of habit task and each calculate node of above-mentioned GPU cluster, to avoid the interference between computing resource.
In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed The size of the data set of sample size, above-mentioned deep learning task to be processed that each step of task is analyzed.
In the embodiment of the present application, above-mentioned deep learning task characterization to be processed is described as Ji=(jUCPU,jUhMem, jThPI/O,jUGPU,jUdMem,jThPPCIe, batch_size, dsize), wherein jUCPURepresent above-mentioned deep learning task to be processed Maximum CPU usage, jUhMemRepresent host memory utilization rate, the jThP of above-mentioned deep learning task to be processedI/OIn representative State I/O handling capacity, the jU of deep learning task to be processedGPURepresent above-mentioned deep learning task to be processed GPU utilization rate, jUdMemRepresent device memory utilization rate, the jThP of above-mentioned deep learning task to be processedPCIeRepresent above-mentioned deep learning to be processed The bandwidth utilization rate of task, batch_size represent above-mentioned deep learning task to be processed in each training step for analysis Sample size, dsize represent the size of the data set of above-mentioned deep learning task to be processed.
In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned deep learning task to be processed of analysis and above-mentioned The mission bit stream of each task in each calculate node waiting list of GPU cluster, respectively obtain above-mentioned deep learning task to be processed and The similitude of above-mentioned each calculate node of GPU cluster, comprising:
By the task letter in above-mentioned deep learning mission bit stream to be processed and each calculate node waiting list of above-mentioned GPU cluster Breath inputs in default similitude prediction model, respectively obtain above-mentioned deep learning task to be processed feature vector and above-mentioned GPU The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree, cos(θJisjk)JiSjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.
Above-mentioned default similitude prediction model can be IASP (Interference-Aware SimilarityPrediction, the prediction of interference perception similitude) model.
For example, defining deep learning task-metric matrix M to be processedJCharacterize above-mentioned deep learning task to be processed, Middle matrix MJEvery a line represent an above-mentioned deep learning task to be processed, each column represent above-mentioned deep learning to be processed and appoint One performance characteristic of business, i.e., be made of cpu resource utilization rate and GPU resource utilization rate.
Obtain above-mentioned deep learning task J to be processediMission bit stream, by the task of above-mentioned deep learning task to be processed Information standardization simultaneously inserts above-mentioned matrix MJMiddle corresponding line, deep learning task J to be processed to each of queuei, use decimal It is analyzed it according to amount, two features is arbitrarily selected to obtain its characteristic value from the feature of deep learning task to be processed.Make With Virtual File System, i.e. proc file system analyzes cpu resource utilization rate, obtains CPU measurement, and service performance analyzes work Tool obtains GPU measurement, the performance metric that analysis obtains is filled out such as NVIDIA Profiler tool analysis GPU resource utilization rate Write deep learning task characterization vector J to be processediIn, and by vector JiIt is inserted into matrix MJIn.
Above-mentioned matrix M is predicted using IASP modelJThe characteristic value of middle missing, by the feature of IASP model prediction to upper State matrix MJIt is filled, obtains complete matrix.The feature of above-mentioned deep learning task to be processed is obtained by IASP model Vector.Similarly, the mission bit stream of the task in each calculate node waiting list of above-mentioned GPU cluster is inputted into IASP model, obtained The feature vector of task in above-mentioned each calculate node waiting list of GPU cluster.
Specifically, IASP model be depth collaborative filtering model, as DCF (Deep Collaborative Filtering, Depth collaborative filtering) model carries out similitude prediction., further, stochastic gradient descent method (SGD) Lai Youhua can be used DCF model.
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree, cos(θJjSjk)JisjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.For example, Ji With SjSimilitude be JiWith SjIn each task similitude average value, that is, calculate separately above-mentioned deep learning task to be processed JiWith the S in each calculate node waiting list of above-mentioned GPU clusterjAfter similitude between middle n task, each similitude is calculated Average value obtains appointing in above-mentioned i-th deep learning task to be processed and j-th of calculate node waiting list of above-mentioned GPU cluster The similitude of business.
Because of SjComprising n task, then need to calculate above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster In j-th of calculate node waiting list in each task between similitude, calculate the average value of all similitudes, obtain To the task in j-th of calculate node waiting list in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster Similitude.
Step 130, according to above-mentioned similitude, mesh of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster is determined Mark calculate node.
According to above-mentioned similitude, similitude is calculate node in the smallest above-mentioned GPU cluster, as above-mentioned depth to be processed Target computing nodes of the learning tasks in above-mentioned GPU cluster.
Select the smallest calculate node of similitude, can be fought for avoid computing resource, improve GPU cluster execution efficiency and The utilization rate of computing resource.
Step 140, the GPU number needed according to above-mentioned deep learning task to be processed, by above-mentioned deep learning to be processed Task is divided into multiple target subtasks.
After above-mentioned deep learning task to be processed is assigned to the target computing nodes in above-mentioned GPU cluster, according to it is above-mentioned to The GPU number that deep learning task needs is handled, above-mentioned deep learning task to be processed is divided into multiple target subtasks, such as Deep learning task J to be processedi, define deep learning task to be processed and be divided into multiple target subtasks and beWherein n represents deep learning task J to be processediIn subtask quantity.
Definition target subtask is Ti j, j ∈ { 1 ..., n }, Ti j∈Ji, n indicates that above-mentioned deep learning task to be processed is divided into For n target subtask, wherein Ti jIndicate j-th of target subtask.
Step 150, each GPU analyzed in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is being held The mission bit stream of capable each subtask respectively obtains the disturbance level and communication price of each above-mentioned target subtask.
The GPU collection defined in above-mentioned target computing nodes is combined into Gk, k ∈ { 1 ..., m }, wherein m indicates target computing nodes In shared a m GPU, GkFor k-th of GPU.
Defining target subtask is
Ti j=(tESM,tUL1,tThPL1,tUL2,tThPL2,tUTex,tThPTex,tUDRAM,tThPDRAM,tThPL,tThPS, Batch_size, sub_dsize), wherein tESM、tUL1、tThPL1、tUL2、tThPL2、tUTex、tThPTex、tUDRAM、tThPDRAM、 tThPLAnd tThPSThe SM efficiency, GPU L1 cache utilization rate, L1 for respectively representing above-mentioned target subtask read handling capacity, GPU L2 cache utilization rate, L2 read handling capacity, GPU texture caching utilization rate, texture cache read handling capacity, GPU memory usage, Memory reads handling capacity, global load handling capacity, global storage handling capacity, and sub_dsize indicates the data set of target subtask Size.
In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned target subtask of analysis and above-mentioned target are calculated The mission bit stream for each subtask that each GPU in node is carrying out respectively obtains the disturbance level of each above-mentioned target subtask And communication price, comprising:
Above-mentioned multiple target subtasks are respectively mapped to each GPU in above-mentioned target computing nodes, target is obtained and appoints The mapping relations of business and each GPU;
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain performance degradation when above-mentioned target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of above-mentioned target subtask is calculated separately according to each above-mentioned performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain above-mentioned target The communication cost of task.
Define the ratio between deadline when deadline and the isolated operation when subtask is run together with other subtasks To indicate performance degradation.
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Default capabilities prediction model can be IAPP (Interference-Aware PerformancePrediction, interference perception Performance prediction) model, specifically, default capabilities prediction model can be DNN (Deep Neural Network depth nerve net Network) model.
For example, defining target subtask is Ji={ Ti 1,…,Ti n, j ∈ { 1 ..., n }, Ti j∈Ji, n indicate it is above-mentioned wait locate Reason deep learning task is divided into n target subtask, wherein Ti jIt indicates j-th of target subtask, defines above-mentioned target and calculate Each GPU collection in node is combined into G={ G1…Gm, wherein m indicates there be m GPU in target computing nodes, by above-mentioned multiple mesh Mark subtask Ji={ Ti 1,…,Ti nIt is respectively mapped to each GPU in above-mentioned target computing nodes, obtain target subtask and each The mapping relations of a GPU, definition mapping relations are M (j)=k, j ∈ { 1 ..., n }, and k ∈ { 1 ..., m }, n indicate above-mentioned to be processed Deep learning task is divided into n target subtask, and m indicates a shared m GPU in target computing nodes, M (j)=k expression the J target subtask Ti jIt is assigned to k-th of GPU, i.e. Gk
Specifically, defining subtask-metric matrix MtCharacterize subtask, subtask is by above-mentioned target subtask or each The subtask that GPU is carrying out.
Each subtask that each GPU is carrying out is by matrix MtIn vector input IAPP model output be vector zt, vector ztIn performance degradation of each element representation target subtask when being executed jointly with other subtasks.
It is calculated using following formula by target subtask Ti jThe average behavior being assigned on a GPU degrades.
Wherein SlowdownjkIndicate target subtask Ti jIt is assigned in above-mentioned target computing nodes flat on k-th of GPU Equal performance degradation, numk indicate the subtask quantity executed jointly on k-th of GPU in above-mentioned target computing nodes.
Defining disturbance level isWherein I (M (j)) represents j-th of target Subtask Ti jMapping relations be M (j) disturbance level, n indicates that deep learning task to be processed is divided into n target subtask. It will be from GiTo GjCommunication cost be defined as ccij, i ∈ { 1 ..., m }, j ∈ { 1 ..., m }, GiIt indicates in above-mentioned target computing nodes I-th of GPU, GjIndicate j-th of GPU in above-mentioned target computing nodes.Communication cost can be by the available bandwidth between physics GPU It is calculated, wherein High Availabitity bandwidth means low communication cost, and low available bandwidth means high communications cost.
By target subtask Ti iWith target subtask Ti jBetween communication requirement be defined as crij, communication requirement, which defines, to be equal to Data volume needed for updating model.Target subtask T is obtained according to the following formulai iWith target subtask Ti jBetween communication generation Valence:
Communication cost when C (M) represents mapping relations as M.
Step 160, it according to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, calculates and saves in above-mentioned target The execution GPU of each above-mentioned target subtask is determined in the GPU of point respectively.
By considering the disturbance level and communication price of target subtask, the resource allocation in calculate node on GPU is avoided It is unbalanced, the highly-parallel of deep learning task is realized, the resource utilization of GPU cluster is improved, while improving depth Spend the execution efficiency of learning tasks.
In a kind of possible embodiment, the above-mentioned disturbance level according to each above-mentioned target subtask and above-mentioned communication generation Valence determines the execution GPU of each above-mentioned target subtask respectively in the GPU of above-mentioned target computing nodes, comprising:
Determine the objective function of target GPU, wherein above-mentioned objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, above-mentioned mapping relations GPU corresponding when being M is above-mentioned target GPU.
By considering the disturbance level and communication price of target subtask, the resource allocation in calculate node on GPU is avoided It is unbalanced, the highly-parallel of deep learning task is realized, the resource utilization of GPU cluster is improved, while improving depth Spend the execution efficiency of learning tasks.
In a kind of possible embodiment, the above-mentioned disturbance level according to each above-mentioned target subtask and above-mentioned communication generation Valence determines the execution GPU of each above-mentioned target subtask respectively in the GPU of above-mentioned target computing nodes, comprising:
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price The target GPU in calculate node is marked, above-mentioned default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.
Such as particle swarm optimization algorithm initializes a group particle, particle represents the target subtask to be distributed.Then we From GPU set G={ G1…GmIn random initializtion particle position and random initializtion particle rapidity, distribute to the every of particle The fitness value of a dimension is the number of GPU, for example k is k-th of GPU, i.e., particle indicates the mapping of multiple tasks to GPU.
Each particle is by its current locationAnd present speedIt indicates, and each particle both knows about particle and found Optimum position pbest and population entire so far in global optimum position gbest.The principle of the algorithm is mobile These particles are to find optimal solution.Each particle position is influenced by its optimum position pbest and overall situation optimum position gbest. The fitness value by optimizing the fitness function calculating in every generation can be used to update its optimum position in each particle pbest.In each iterative process, each example can update its speed and position using following formula.
Wherein, ω is the inertia weight for maintaining particle, c1And c2It is accelerator coefficient, r1And r2Be between 0 to 1 with Machine number.
Use objective functionThe assessment of each particle is executed, then according to following two Formula updates speed and the position of each particle:
Iteration is executed until reaching specified the number of iterations or meeting iteration precision, finds optimal GPU.
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price The target GPU in calculate node is marked, the resource utilization for improving GPU cluster can be maximized, while maximizing and improving depth The execution efficiency of habit task.
By first analyzing the similitude of each calculate node of deep learning task and GPU cluster to be processed, depth to be processed is determined Target computing nodes of the learning tasks in GPU cluster are spent, by fully considering deep learning task to be processed and other tasks Between similitude, a possibility that reduce calculate node contention for resources, to improve deep learning task system resource Utilization rate and execution efficiency, further according to GPU number of the needs of deep learning task to be processed, by deep learning task to be processed It is divided into multiple target subtasks, the disturbance level and communication price of target subtask is analyzed, so that it is determined that target subtask is in mesh The target GPU in calculate node is marked, by considering the disturbance level and communication price of target subtask, is avoided in calculate node Resource allocation on GPU is unbalanced, realizes the highly-parallel of deep learning task, improves the utilization of resources of GPU cluster Rate, while improving the execution efficiency of deep learning task.
The embodiment of the present application also provides a kind of devices, and referring to fig. 2, Fig. 2 is the GPU cluster depth of the embodiment of the present application A kind of schematic diagram of habit task parallelization device, above-mentioned apparatus include:
Acquisition module 210, for obtaining the task-set information of deep learning task and GPU cluster to be processed, above-mentioned GPU collection Group task-set information include in each calculate node waiting list of GPU cluster the mission bit stream of each task, GPU cluster respectively calculate section The mission bit stream for each subtask that each GPU is carrying out in point;
First analysis module 220, for analyze above-mentioned deep learning task to be processed mission bit stream and above-mentioned GPU cluster The mission bit stream of each task in each calculate node waiting list, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU The similitude of each calculate node of cluster;
Calculate node determining module 230, for determining above-mentioned deep learning task to be processed upper according to above-mentioned similitude State the target computing nodes in GPU cluster;
Subtask module 240, the GPU number for being needed according to above-mentioned deep learning task to be processed, by above-mentioned wait locate Reason deep learning task is divided into multiple target subtasks;
Second analysis module 250, in the mission bit stream and above-mentioned target computing nodes for analyzing above-mentioned target subtask The mission bit stream of each subtask that is carrying out of each GPU, respectively obtain the disturbance level and communication of each above-mentioned target subtask Cost;
GPU determining module 260, for according to each above-mentioned target subtask disturbance level and above-mentioned communication price, upper State the execution GPU for determining each above-mentioned target subtask in the GPU of target computing nodes respectively.
In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed At least one of the sample size of task each step analysis, size of data set of above-mentioned deep learning task to be processed.
In a kind of possible embodiment, above-mentioned first analysis module 220 is specifically used for:
By the task letter in above-mentioned deep learning mission bit stream to be processed and each calculate node waiting list of above-mentioned GPU cluster Breath inputs in default similitude prediction model, respectively obtain above-mentioned deep learning task to be processed feature vector and above-mentioned GPU The feature vector of task in each calculate node waiting list of cluster;
According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately To the similitude between each task in queue:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein SjInclude n task, SjkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting listJiSjkIt is above-mentioned JiWith above-mentioned SjkAngle degree, cos(θJiSjk)JiSjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster To the similitude of k-th of task in queue, ‖ Ji‖ is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.One kind can In the embodiment of energy, above-mentioned second analysis module 250 is specifically used for:
Above-mentioned multiple target subtasks are respectively mapped to each GPU in above-mentioned target computing nodes, target is obtained and appoints The mapping relations of business and each GPU;
Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain performance degradation when above-mentioned target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of above-mentioned target subtask is calculated separately according to each above-mentioned performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain above-mentioned target The communication cost of task.
In a kind of possible embodiment, above-mentioned second analysis module 250 is specifically used for:
Determine the objective function of target GPU, wherein above-mentioned objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represent mapping relations as Communication cost when M, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, above-mentioned mapping relations GPU corresponding when being M is above-mentioned target GPU.
In a kind of possible embodiment, above-mentioned second analysis module 250 is specifically used for:
Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price The target GPU in calculate node is marked, above-mentioned default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.
The embodiment of the present application also provides a kind of electronic equipment, referring to Fig. 3, comprising: processor 310, communication interface 320, Memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 are complete by communication bus 340 At mutual communication,
Above-mentioned memory 330, for storing computer program;
Above-mentioned processor 310 realizes following steps when for executing the computer program of the above-mentioned storage of memory 330:
Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of above-mentioned GPU cluster Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster The mission bit stream of capable each subtask;
It analyzes in the mission bit stream and each calculate node waiting list of above-mentioned GPU cluster of above-mentioned deep learning task to be processed The mission bit stream of each task respectively obtains the similar of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster Property;
According to above-mentioned similitude, determine that target of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster calculates section Point;
According to the GPU number that above-mentioned deep learning task to be processed needs, above-mentioned deep learning task to be processed is divided into Multiple target subtasks;
Analyze each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask;
According to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, in the GPU of above-mentioned target computing nodes The middle execution GPU for determining each above-mentioned target subtask respectively.
For example, the processor 310 of electronic equipment includes the GPU cluster that centralized control unit and multiple GPU are formed, wherein GPU cluster includes multiple calculate nodes, and each calculate node is made of multiple GPU, centralized control unit include data collector, Cluster management unit, node management unit, electronic equipment are used for multiple deep learning task parallel processings in GPU cluster.Its In, data collector obtains the task-set information of deep learning task and GPU cluster to be processed, and cluster management unit analysis is above-mentioned The task letter of each task in the mission bit stream of deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster Breath, respectively obtains the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster, and according to above-mentioned phase Like property, determine target computing nodes of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster, node management unit according to Above-mentioned deep learning task to be processed is divided into multiple target and appointed by the GPU number that above-mentioned deep learning task to be processed needs Business, then analyzes each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask, finally according to each above-mentioned mesh The disturbance level of mark subtask and above-mentioned communication price, determine each above-mentioned target respectively in the GPU of above-mentioned target computing nodes The execution GPU of subtask.
Optionally, processor 310 when for executing the program stored on memory 330, can also be realized any of the above-described GPU cluster deep learning task parallel method.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), GPU (Graphic Processing Unit, image processor), network processing unit (Network Processor, NP) Deng;It can also be digital signal processor (Digital Signal Processing, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components.
In the embodiment of the present application, additionally provide a kind of storage medium, this can be stored with instruction in storage medium, when its When being run on computer, so that computer executes any GPU cluster deep learning task parallel method in above-described embodiment.
In the embodiment of the present application, a kind of computer readable storage medium is additionally provided, the computer readable storage medium In be stored with instruction, when run on a computer, so that computer executes any above-mentioned GPU cluster in above-described embodiment Deep learning task parallel method.
It should be noted that, in this document, as long as the technical characteristic non-contradiction in each optinal plan can combine and carry out shape At scheme, these schemes are in range disclosed in the present application.Relational terms such as first and second and the like are used merely to It distinguishes one entity or operation from another entity or operation, without necessarily requiring or implying these entities or behaviour There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any other variant It is intended to non-exclusive inclusion, so that including that the process, method, article or equipment of a series of elements not only includes Those elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of person's equipment.In the absence of more restrictions, the element limited by sentence "including a ...", not There is also other identical elements in the process, method, article or equipment for including above-mentioned element for exclusion.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For the embodiment of electronic equipment and storage medium, since it is substantially similar to the method embodiment, so be described relatively simple, The relevent part can refer to the partial explaination of embodiments of method.
The preferred embodiment that above are only the application above is not intended to limit the protection scope of the application.It is all Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application It is interior.

Claims (10)

1. a kind of GPU cluster deep learning task parallel method characterized by comprising
The task-set information of deep learning task and GPU cluster to be processed is obtained, the task-set information of the GPU cluster includes The mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is carrying out in each calculate node of GPU cluster Subtask each mission bit stream;
Each is analyzed in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed The mission bit stream of business respectively obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster;
According to the similitude, target computing nodes of the deep learning task to be processed in the GPU cluster are determined;
According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into multiple Target subtask;
Analyze each subtask that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out Mission bit stream, respectively obtain the disturbance level and communication price of each target subtask;
According to the disturbance level of each target subtask and the communication price, divide in the GPU of the target computing nodes The execution GPU of each target subtask is not determined.
2. the method according to claim 1, wherein the mission bit stream of the deep learning task to be processed, packet It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate Manage the size of the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed.
3. the method according to claim 1, wherein the task of the analysis deep learning task to be processed The mission bit stream of each task, respectively obtains the depth to be processed in information and each calculate node waiting list of the GPU cluster The similitude of learning tasks and each calculate node of the GPU cluster, comprising:
By the task of each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Information input is preset in similitude prediction model, and the feature vector of the deep learning task to be processed and described is respectively obtained The feature vector of each task in each calculate node waiting list of GPU cluster;
According to the following formula, it calculates separately the deep learning task to be processed and each calculate node of the GPU cluster waits team The similitude between each task in column:
Wherein, JiRepresent the feature vector of i-th of deep learning task to be processed, SjRepresent j-th of the calculating section in GPU cluster The feature vector of task in waiting list is put, wherein SjInclude n task, SjkRepresent j-th of calculate node etc. in GPU cluster To the feature vector of k-th of task in queue, k ∈ { 1 ..., n }, θJiSjkIt is the JiWith the SjkAngle degree, cos (θJiSjk)JiSjkJ-th of the calculate node represented in described i-th deep learning task to be processed and the GPU cluster waits team The similitude of k-th of task in column, | | Ji| | it is JiVector field homoemorphism, | | Sjk| | it is SjkVector field homoemorphism;
According between each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster Similitude obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.
4. the method according to claim 1, wherein mission bit stream and the institute of the analysis target subtask The mission bit stream for stating each subtask that each GPU in target computing nodes is carrying out respectively obtains each target subtask Disturbance level and communication price, comprising:
The multiple target subtask is respectively mapped to each GPU in the target computing nodes, obtain target subtask with The mapping relations of each GPU;
Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, respectively Obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out;
The disturbance level of the target subtask is calculated separately according to each performance degradation;
Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain the target subtask Communication cost.
5. the method according to claim 1, wherein the disturbance level according to each target subtask and The communication price determines the execution GPU of each target subtask respectively in the GPU of the target computing nodes, comprising:
Determine the objective function of target GPU, wherein the objective function are as follows:
In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, when C (M) represents mapping relations as M Communication cost, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1;
WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.
6. -5 any method according to claim 1, which is characterized in that the interference according to each target subtask Grade and the communication price determine the execution of each target subtask respectively in the GPU of the target computing nodes GPU, comprising:
Determine the target subtask in the target meter according to default optimization algorithm, the disturbance level and the communication price Target GPU in operator node, the default optimization algorithm are that ant group algorithm, genetic algorithm, simulated annealing or population are excellent Change algorithm.
7. a kind of GPU cluster deep learning task parallelization device characterized by comprising
Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster is appointed Business collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster The mission bit stream for each subtask that GPU is carrying out;
First analysis module, mission bit stream and the GPU cluster for analyzing the deep learning task to be processed respectively calculate The mission bit stream of each task in node waiting list, respectively obtains the deep learning task to be processed and the GPU cluster is each The similitude of calculate node;
Calculate node determining module, for determining the deep learning task to be processed in the GPU collection according to the similitude Target computing nodes in group;
Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed Habit task is divided into multiple target subtasks;
Second analysis module, each GPU in mission bit stream and the target computing nodes for analyzing the target subtask The mission bit stream for each subtask being carrying out respectively obtains the disturbance level and communication price of each target subtask;
GPU determining module, for according to each target subtask disturbance level and the communication price, in the target meter The execution GPU of each target subtask is determined in the GPU of operator node respectively.
8. device according to claim 7, which is characterized in that the mission bit stream of the deep learning task to be processed, packet It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate In the size for managing the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed It is at least one.
9. a kind of electronic equipment characterized by comprising processor, communication interface, memory and communication bus, wherein
The processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes GPU collection of any of claims 1-6 Group's deep learning task parallel method.
10. a kind of storage medium, which is characterized in that be stored with computer program, the computer program in the storage medium GPU cluster deep learning task parallel method described in any one of claims 1-6 is realized when being executed by processor.
CN201910675587.9A 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment Active CN110399222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910675587.9A CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910675587.9A CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110399222A true CN110399222A (en) 2019-11-01
CN110399222B CN110399222B (en) 2022-01-21

Family

ID=68325235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910675587.9A Active CN110399222B (en) 2019-07-25 2019-07-25 GPU cluster deep learning task parallelization method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110399222B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104289A (en) * 2019-12-25 2020-05-05 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster
CN111258735A (en) * 2020-01-16 2020-06-09 中国人民解放军国防科技大学 Deep learning task scheduling method supporting QoS (quality of service) perception of user
CN111309479A (en) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
CN111866187A (en) * 2020-06-30 2020-10-30 中科院计算所西部高等技术研究院 Task scheduling method of distributed deep learning reasoning cloud platform
CN111913799A (en) * 2020-07-14 2020-11-10 北京华夏启信科技有限公司 Video stream online analysis task scheduling method and computer equipment
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN112584143A (en) * 2020-12-02 2021-03-30 浙江大华技术股份有限公司 Video coding method, device and system and computer readable storage medium
CN112965809A (en) * 2019-12-12 2021-06-15 深圳市优必选科技股份有限公司 Deep learning task processing system and method
CN113194086A (en) * 2021-04-27 2021-07-30 新华三信息安全技术有限公司 Anti-attack method and device
CN113377520A (en) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 Resource scheduling method, device, equipment and storage medium
CN113900793A (en) * 2021-07-29 2022-01-07 苏州浪潮智能科技有限公司 Server cluster and deep learning aggregate communication system and method thereof
CN114116220A (en) * 2021-11-29 2022-03-01 苏州浪潮智能科技有限公司 GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium
CN114138449A (en) * 2021-12-14 2022-03-04 河南省儿童医院郑州儿童医院 Rehabilitation training system based on virtual reality
CN114285766A (en) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 Network bandwidth detection method and device, electronic equipment and storage medium
WO2022116142A1 (en) * 2020-12-04 2022-06-09 深圳大学 Resource scheduling method based on graph neural network
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing
CN115373861A (en) * 2022-10-26 2022-11-22 小米汽车科技有限公司 GPU resource scheduling method and device, electronic equipment and storage medium
WO2024022046A1 (en) * 2022-07-28 2024-02-01 华为技术有限公司 Deep learning system and method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125369A1 (en) * 2003-12-09 2005-06-09 Microsoft Corporation System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit
US20140201741A1 (en) * 2010-07-26 2014-07-17 Microsoft Corporation Workload interference estimation and performance optimization
US20150301862A1 (en) * 2012-09-14 2015-10-22 International Business Machines Corporation Preferential cpu utilization for tasks
CN105900064A (en) * 2014-11-19 2016-08-24 华为技术有限公司 Method and apparatus for scheduling data flow task
CN107045456A (en) * 2016-02-05 2017-08-15 华为技术有限公司 A kind of resource allocation methods and explorer
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107329828A (en) * 2017-06-26 2017-11-07 华中科技大学 A kind of data flow programmed method and system towards CPU/GPU isomeric groups
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN109101339A (en) * 2018-08-15 2018-12-28 北京邮电大学 Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group
CN109936604A (en) * 2017-12-18 2019-06-25 北京图森未来科技有限公司 A kind of resource regulating method, device and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125369A1 (en) * 2003-12-09 2005-06-09 Microsoft Corporation System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit
US20140201741A1 (en) * 2010-07-26 2014-07-17 Microsoft Corporation Workload interference estimation and performance optimization
US20150301862A1 (en) * 2012-09-14 2015-10-22 International Business Machines Corporation Preferential cpu utilization for tasks
CN105900064A (en) * 2014-11-19 2016-08-24 华为技术有限公司 Method and apparatus for scheduling data flow task
CN107045456A (en) * 2016-02-05 2017-08-15 华为技术有限公司 A kind of resource allocation methods and explorer
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107329828A (en) * 2017-06-26 2017-11-07 华中科技大学 A kind of data flow programmed method and system towards CPU/GPU isomeric groups
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN109936604A (en) * 2017-12-18 2019-06-25 北京图森未来科技有限公司 A kind of resource regulating method, device and system
CN109101339A (en) * 2018-08-15 2018-12-28 北京邮电大学 Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GEORGE TEODORO ET.AL: ""Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms"", 《ARXIV》 *
HAITAO ZHANG ET.AL: ""Learning Driven Parallelization for Large-Scale Video Workload in Hybrid CPU-GPU Cluster"", 《ICPP 2018》 *
JEON, M. ET.AL: ""Multi-tenant GPU clusters for deep learning workloads"", 《TECHNICAL REPORT, MSR-TR-2018》 *
WEI QIAO ET.AL: ""DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment"", 《ITM WEB OF CONFERENCES》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112965809A (en) * 2019-12-12 2021-06-15 深圳市优必选科技股份有限公司 Deep learning task processing system and method
CN111104289A (en) * 2019-12-25 2020-05-05 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster
CN111258735A (en) * 2020-01-16 2020-06-09 中国人民解放军国防科技大学 Deep learning task scheduling method supporting QoS (quality of service) perception of user
CN111309479A (en) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
US11954522B2 (en) 2020-02-14 2024-04-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for processing tasks in parallel, device and storage medium
CN111866187A (en) * 2020-06-30 2020-10-30 中科院计算所西部高等技术研究院 Task scheduling method of distributed deep learning reasoning cloud platform
CN111866187B (en) * 2020-06-30 2022-10-04 中科院计算所西部高等技术研究院 Task scheduling method for distributed deep learning reasoning cloud platform
CN111913799B (en) * 2020-07-14 2024-04-19 北京华夏启信科技有限公司 Video stream online analysis task scheduling method and computer equipment
CN111913799A (en) * 2020-07-14 2020-11-10 北京华夏启信科技有限公司 Video stream online analysis task scheduling method and computer equipment
CN112416585B (en) * 2020-11-20 2024-03-15 南京大学 Deep learning-oriented GPU resource management and intelligent scheduling method
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN112584143A (en) * 2020-12-02 2021-03-30 浙江大华技术股份有限公司 Video coding method, device and system and computer readable storage medium
CN112584143B (en) * 2020-12-02 2022-09-06 浙江大华技术股份有限公司 Video coding method, device and system and computer readable storage medium
WO2022116142A1 (en) * 2020-12-04 2022-06-09 深圳大学 Resource scheduling method based on graph neural network
CN113194086B (en) * 2021-04-27 2022-05-27 新华三信息安全技术有限公司 Anti-attack method and device
CN113194086A (en) * 2021-04-27 2021-07-30 新华三信息安全技术有限公司 Anti-attack method and device
CN113377520A (en) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 Resource scheduling method, device, equipment and storage medium
CN113900793A (en) * 2021-07-29 2022-01-07 苏州浪潮智能科技有限公司 Server cluster and deep learning aggregate communication system and method thereof
CN113900793B (en) * 2021-07-29 2023-11-10 苏州浪潮智能科技有限公司 Server cluster and deep learning aggregate communication system and method thereof
CN114285766A (en) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 Network bandwidth detection method and device, electronic equipment and storage medium
CN114285766B (en) * 2021-08-20 2023-06-13 腾讯科技(深圳)有限公司 Network bandwidth detection method and device, electronic equipment and storage medium
CN114116220A (en) * 2021-11-29 2022-03-01 苏州浪潮智能科技有限公司 GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium
CN114138449A (en) * 2021-12-14 2022-03-04 河南省儿童医院郑州儿童医院 Rehabilitation training system based on virtual reality
WO2024022046A1 (en) * 2022-07-28 2024-02-01 华为技术有限公司 Deep learning system and method
CN115248728A (en) * 2022-09-21 2022-10-28 之江实验室 Distributed training task scheduling method, system and device for intelligent computing
CN115373861A (en) * 2022-10-26 2022-11-22 小米汽车科技有限公司 GPU resource scheduling method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110399222B (en) 2022-01-21

Similar Documents

Publication Publication Date Title
CN110399222A (en) GPU cluster deep learning task parallel method, device and electronic equipment
Mapetu et al. Low-time complexity and low-cost binary particle swarm optimization algorithm for task scheduling and load balancing in cloud computing
CN1956457B (en) Method and apparatus for arranging mesh work in mesh computing system
CN110389820B (en) Private cloud task scheduling method for resource prediction based on v-TGRU model
US9239734B2 (en) Scheduling method and system, computing grid, and corresponding computer-program product
CN105373432B (en) A kind of cloud computing resource scheduling method based on virtual resource status predication
Abdel‐Basset et al. IEGA: an improved elitism‐based genetic algorithm for task scheduling problem in fog computing
You et al. Comprehensive workload analysis and modeling of a petascale supercomputer
Chen et al. Scheduling independent tasks in cloud environment based on modified differential evolution
Li et al. An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters
Rani et al. An efficient and scalable hybrid task scheduling approach for cloud environment
CN108427602B (en) Distributed computing task cooperative scheduling method and device
Ding et al. Kubernetes-oriented microservice placement with dynamic resource allocation
CN115168027A (en) Calculation power resource measurement method based on deep reinforcement learning
CN113553160A (en) Task scheduling method and system for edge computing node of artificial intelligence Internet of things
Hu et al. Improved heuristic job scheduling method to enhance throughput for big data analytics
Xilin et al. Resource allocation optimization of equipment development task based on MOPSO algorithm
CN112000460A (en) Service capacity expansion method based on improved Bayesian algorithm and related equipment
CN117349026B (en) Distributed computing power scheduling system for AIGC model training
Ghafari et al. E-AVOA-TS: Enhanced African vultures optimization algorithm-based task scheduling strategy for fog–cloud computing
Li et al. Dynamic data replacement and adaptive scheduling policies in spark
Zhou et al. Stability property of clouds and cooperative scheduling policies on multiple types of resources in cloud computing
CN116225708A (en) GPU resource scheduling method and device
Yassir et al. Graph-based model and algorithm for minimising big data movement in a cloud environment
Yu [Retracted] Research on Optimization Strategy of Task Scheduling Software Based on Genetic Algorithm in Cloud Computing Environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant