CN107168782A - A kind of concurrent computational system based on Spark and GPU - Google Patents
A kind of concurrent computational system based on Spark and GPU Download PDFInfo
- Publication number
- CN107168782A CN107168782A CN201710270400.8A CN201710270400A CN107168782A CN 107168782 A CN107168782 A CN 107168782A CN 201710270400 A CN201710270400 A CN 201710270400A CN 107168782 A CN107168782 A CN 107168782A
- Authority
- CN
- China
- Prior art keywords
- gpu
- resource
- stage
- task
- spark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17306—Intercommunication techniques
- G06F15/17318—Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17306—Intercommunication techniques
- G06F15/17331—Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to parallel computing field, specially a kind of parallel computation frame system based on Spark and GPU.The present invention be based on YARN resource management platforms, by improving its explorer and node manager, can effectively perceive isomeric group GPU resource, so as to support the management to cluster GPU resource and scheduling;Then under YARN deployment modes, job scheduling mechanism and tasks carrying mechanism to Spark are improved, and it is supported scheduling and execution to GPU type tasks.By introducing the mark to GPU resource in the stages such as resource bid, resource allocation, DAG generations, stage divisions and tasks carryings, enforcement engine is set to perceive GPU task, and effectively performed in isomeric group;Simultaneously using the Spark characteristics that efficient internal memory is calculated in itself, the effective programming model under the framework is proposed with reference to the advantage that GPU multi-core parallel concurrents are calculated.The present invention can effectively processing data it is intensive with computation-intensive operation, job processing efficiency is greatly improved.
Description
Technical field
The invention belongs to parallel computing field, and in particular to a kind of parallel computation frame system based on Spark and GPU
System.
Background technology
Today's society, every profession and trade needs data scale to be processed presented magnanimity trend, and big data causes social each
The extensive concern of industry.Undoubtedly, big data contains abundant useful information, if it is possible to reasonably excavates and uses
Big data, will produce huge facilitation to scientific research and social economy.By the information contained in big data can
Auxiliary commerce decision-making and scientific research, so having obtained quick development and application in many industries.In big data
Dai Zhong, all are data-centered, from the historical data of magnanimity can mining analysis go out many can not obtain otherwise
The effective information obtained, so as to improve the accuracy of decision-making.
Developing into for Distributed Calculation fully excavates data value there is provided effective means.Distributed Calculation can be utilized
Cheap computer cluster, fast calculation analysis is carried out to mass data, has effectively to save data analysis cost.
Under such environment, a collection of distributed computing framework technology is arisen at the historic moment, the spy that wherein Spark is calculated due to it based on internal memory
Property, can effectively lift the efficiency of data processing, and be had a wide range of applications in fields such as machine learning and interactive analyses.
At the same time, the characteristic that GPU possesses numerous cores because of it, makes it to obtain in many applications than simple CPU
Higher computational efficiency is calculated, and this acceleration effect is often weighed with ten times or hundred times.Compared to simple lifting CPU
Performance, carries out parallel computation often more cheap and effective using GPU.This causes GPU to have important in high-performance computing sector
Status.
Although Spark can the effectively intensive operation of processing data, less fitted for computation-intensive operation
Close.And cluster scale extension is limited, if calculated using CPU merely, then the process performance for high-volume operation is carried
Rise and still have much room for improvement.If the support to GPU equipment can be introduced in Spark, it is set to give full play to Spark sheets
The characteristic that height effect internal memory is calculated, the advantage that can be calculated again using GPU multi-core parallel concurrents, this is by significant increase to mass data
Treatment effeciency.
Do not have to introduce the support to GPU equipment in primary Spark frameworks, it is existing that GPU speed-up computations are called in Spark
Solution be that based on calling C/C++ programs to be handled in Java/Scala language, this mode has many drawbacks.
Due to GPU calculating tasks can not be perceived in Spark, so it cannot be distinguished by CPU tasks and GPU task, performed in scheduler task
When, it would be possible to GPU task can be started in the node without GPU equipment, cause tasks carrying to fail.And in YARN resource managements
In device, the scheduling to CPU and memory source is only supported, it is impossible to perceive GPU resource, it can not be provided to the Spark frameworks on upper strata
Distribution and scheduling to GPU resource.Due to YARN and Spark frameworks itself, the biography that GPU is calculated is performed in Spark
System method can not adapt to isomeric group environment.
The content of the invention
It is an object of the invention to provide a kind for the treatment of effeciency is high, and can adapt to isomeric group environment based on Spark
With GPU concurrent computational system.
The concurrent computational system based on Spark and GPU that the present invention is provided, Spark is integrated with GPU, can
Enough effective processing data intensities and computation-intensive operation, greatly improve job processing efficiency.
The parallel computation frame system based on Spark and GPU that the present invention is provided, including:
Component one, improved resource management platform, its support the multi dimensional resources such as GPU, CPU and internal memory are scheduled with
Management;
Component two, improved Spark distributed computing frameworks, it supports the scheduling and execution to GPU type tasks.
(1) the improved resource management platform, including:
Improve YARN explorer and node manager, can effectively perceive isomeric group GPU resource, from
And support the management and scheduling to cluster GPU resource.Wherein, including resource representation model, resource dispatching model, resource seize mould
The improvement of the binding mechanism of type, resource isolation mechanism and GPU equipment.
(2) the improved Spark distributed computing frameworks, including:
Spark resource bid and distribution mechanism, job scheduling mechanism and tasks carrying mechanism is improved, makes its support pair
The scheduling and execution of GPU type tasks.By in ranks such as resource bid, resource allocation, DAG generations, stage divisions and tasks carryings
Section introduces the mark to GPU resource, its enforcement engine is perceived GPU task, and effectively performed in isomeric group.
In the present invention, the improved resource management platform, it would be preferable to support to including the multi dimensional resource including GPU resource
It is managed and dispatches.Specifically:
On resource representation model, the GPU number of devices included in node is can customize first, and change resource representation association
View, makes it increase the expression to GPU resource.When node starts, node manager initializes the Resources list, and and resource management
Device reports the resource information of the node by heartbeat mechanism.
On resource dispatching model, GPU is added to the layer of resource management platform by the present invention together with CPU, memory source
In level management queue.The uniformity of resource management can not only be so kept, the GPU resource that is directed to that also can be more flexible carries out authority
Setting, be more suitable in large-scale cluster handle multi user operation scene under apply.The present invention is according to DRF algorithms to resource
Scheduler module is modified, and it is added scheduling and management to GPU resource.The algorithm is as follows:
(1) initializing variable.Wherein, R=<totalCPU,totalGPU,totalMem>Represent cluster CPU, GPU and interior
Deposit the total amount of resource.C=<usedCPU,usedGPU,usedMem>Represent CPU, GPU and the internal memory money consumed in cluster
The quantity in source.siRepresent that operation i primary resource accounts for the share of corresponding total resources.Ui=<CPUi,GPUi,Memi>Represent the allocated
Stock number to operation i.Di=<CPUi,GPUi,Memi>Represent the stock number that operation i each task needs.
When choosing operation progress resource allocation every time, following steps are performed successively:
(2) primary resource share s is choseniMinimum Job execution.
(3) if C+Di≤ R, then allocate resources to operation i, updates C=C+Di,Ui=Ui+Di,
si=max { Ui/R}.Otherwise, cluster resource can not meet demand, stop distribution.
On resource preemption model, by Resource Scheduler every kind of resource is set to each queue in level queue can
With the upper limit and lower limit.Resource Scheduler is by the resource allocation of the queue of light load to the queue of other heavier loads to improve collection
Group's resource utilization.But when there is new application program to be submitted to the queue of light load, scheduler can resource preemption mechanism receipts
The resource shared by other queues is returned, so as to will originally belong to the resource allocation of the queue to it.When resource preemption mechanism occurs,
Need to discharge GPU resource.
This work transfers to node manager to complete, and increasing releaseGPU methods newly here is used to discharge GPU resource.
The Resources list information for needing to discharge is sent to the node manager of response, node administration by explorer by heartbeat mechanism
When device is detected in resource entity to be released containing GPU resource, releaseGPU methods can be called to discharge GPU resource.Then
The resource discharged is further distributed to associated queue by explorer.
On resource isolation model, due to Cgroups possess preferable isolation performance and its support to carry out GPU resource every
From the present invention is isolated using Cgroups schemes to GPU resource.
On the binding mechanism of GPU equipment, when including GPU resource in the resource entity for distributing to the task, phase
The node manager answered needs to be bound the GPU equipment on node and the resource entity.If there is multiple free time on node
GPU resource, then need selection one be allocated.GPU running state information is expressed as by the present invention<GPU device numbers,
Resource entity number>List, every data of list identifies the corresponding relation of GPU equipment and related resource entity.Node administration
Device according to the GPU facility informations on associated profile and the node can initialize the list when node just starts.
When there is new task requests to use GPU resource, node manager is by searching the list, so as to obtain in sky
The GPU facility informations of not busy state, and assign them to inter-related task.If possessing multiple GPU resources on node manager node
In idle condition, then the distribution of GPU resource is carried out by round robin.Meanwhile, by the resource entity run and GPU resource
Corresponding informance preserved into database., can be from database in the case where node manager needs to restart
Directly read the distribution information of GPU equipment, it is to avoid reallocation to node resource.
In the present invention, the improved Spark distributed computing frameworks are to be improved for Spark kernels, prop up it
Hold scheduling and execution to GPU type tasks.Specifically:
When submitting operation, if the application controller of Spark application programs detects the application program and needs GPU
Resource, then needed in resource bid, and required GPU resource is added in resource request description.
The Container of application includes two kinds:CPU types Container and GPU types Container.Because for GPU types
Task, it is also desirable to which CPU completes processing, transmission and the startup of GPU of data, so GPU types Container is except needing
Outside the GPU resource of 1 unit, in addition it is also necessary to the core cpu of specified quantity.When applying for resource, it is thus necessary to determine that to be applied two
The Container numbers of type.Here, each Container CPU core numbers to be included are represented with executorCores,
TotalCores represents the CPU core number of application program, and GPUNum represents the GPU resource quantity of application program, then GPU
Type Container quantity is GPUNum, and non-GPU types Container quantity is (totalCores-GPUNum*
executorCores)/executorCores.Then the memory source quantity further according to setting is judged, detects always interior
Deposit whether quantity disclosure satisfy that the memory amount that all Container need, further to handle.Send after resource request, provide
Source scheduler can't immediately return for it and meet desired resource, and need the corresponding application controllers of Spark continuous
Communicated by heartbeat mechanism with explorer, with probe requests thereby to resource whether be assigned.Application program controlling
Device is added into the Resources list to be allocated inside program, to distribute to what is specifically performed after apllied resource is received
Task.
, it is necessary to which the task to GPU is identified in Spark interfaces.The present invention proposes mapPartitionsGPU calculations
Son and mapPartitionsGPURDD, for being handled for GPU task.
Spark job scheduler DAGScheduler generation DAG figures after, start divide stage when, it is necessary to increase word
Whether section includes GPU operation to identify in current stage.Inside a stage, according to the calculating side run on each RDD
Whether method needs GPU resource, and its internal RDD is divided to for two kinds:Need RDDs of the RDD of GPU resource with not needing GPU resource.Such as
When really in the stage comprising the RDD for needing GPU resource, then when the subregion of RDD in for this stage distributes resource, it should be
It distributes enough GPU resources, even if possible only one of which RDD needs when calculating.Otherwise, task can in calculating process
It is able to can cause to perform failure because of no workable GPU resource.GPU resource is needed in order to identify whether to include in stage
RDD, it is necessary to increase field flagGPU for stage, when flagGPU is true, show in the stage comprising needing GPU resource
RDD.By setting flagGPU fields, in the resource allocation of next step, can be recognized by task manager and be its distribution
GPU resource.
In the present invention, the flow of stage types is identified in the job scheduler DAGScheduler inside Spark
It is as follows:
(1) after DAG generations, stage is divided.When generating stage, detect the RDD's that stage inside is included
Whether flagGPU fields are true, if it is, illustrating that the stage needs GPU resource in the process of implementation, mark the stage's
FlagGPU fields are true.The foundation of GPU resource distribution is carried out as later task manager.
(2) enforcement engine submits stage algorithm to be a recursive process, and it can submit last in DAG figures first
One stage, then checks whether the father stage of the stage all has been filed on finishing, and starts to perform this time if all submitting
Task-set corresponding to stage.If his father stage have without submitting, recurrence submits his father stage, and equally makes
Above-mentioned inspection.So final result is schemed according to DAG, stage is performed from front to back.The benefit that so doing is is to be able to ensure that
Its input data has been prepared for finishing when current stage is performed, and the partition data in RDD is when losing, can be along DAG
Figure finds the partition data generated in nearest RDD from back to front, then re-executes to obtain loss subregion.
(3) submit after stage, task manager starts the stage being divided into task-set, and to cluster manager dual system application
Resource needed for performing.Task quantity included in task-set is identical with RDD number of partitions.Task manager is detected first
Whether the flagGPU fields of the stage are true, and the container of GPU resource is if it is included for its distribution.It is being somebody's turn to do
During container distribution, if multiple container can be selected, then judged according to localization strategy.I.e. successively
Select local node, other nodes of this frame and other frame nodes.Then task is started in node in resource, and will
Task intermediate result is stored in storage system with final result.During this, if the container numbers comprising GPU resource
Amount is less than the quantity of GPU type tasks, then temporarily the task of unallocated GPU resource needs to wait, and treats that other tasks carryings are finished, has
GPU resource is then allocated when being in idle condition.
(4) after task is finished, returned resource.The container of recovery is added in list to be allocated, to distribute
Used to other tasks.
The present invention is based on improved framework, it is proposed that a kind of effective programming model for GPU type tasks.
In Spark, the data in RDD are made up of several subregions, and it is finally assigned to some in units of subregion
Complete to calculate in individual node.In fact, performing granularity according to partition data, the type of GPU calculating can will be carried out using Spark
It is broadly divided into two kinds:
(1) GPU is completed in units of subregion to calculate.The data in RDD subregions are all put into GPU parallel to complete
Calculate, improve executing efficiency;
(2) GPU is completed in units of single record to calculate.The data in RDD subregions are put into GPU one by one and complete to count
Calculate, acceleration processing is carried out in units of single record.
In improved framework, the mapPartitionsGPU operators newly increased can perceive GPU type tasks, with the number of partitions
Handled according to as input.The main execution logic of the operator is as follows:
(1) GPU equipment is initialized first in method;
(2) then judge to partition data perform granularity be in units of subregion or in units of wall scroll is recorded.Such as
Fruit is with divisional unit, then partition data is transferred in GPU video memorys using CUDA API, this process may relate to data
The conversion of form, partition data in RDD is converted into the data format that can be handled by GPU.Then GPU pair is called
Data carry out parallel computation, and after the completion of calculating, output result is transmitted into internal memory.If the granularity that partition data is performed is
In units of wall scroll is recorded, then need to one by one to each partitioned record sequential processes.Copy a record data extremely every time
In GPU video memorys, call GPU to carry out parallel computation to data, after the completion of calculating, output result is copied to internal memory
In., it is necessary to which the output result entirely recorded is converted into a partitioned set after all records are disposed;
(3) GPU equipment is discharged, and returns to partitioned set iterator.
Compared with prior art, advantages of the present invention and effect have:
1st, improved resource management platform proposed by the invention can perceive CPU, internal memory and the GPU moneys in isomeric group
Source, and effectively it can be managed and dispatched;
2nd, it is improved based on Spark distributed computing frameworks, can effectively differentiates GPU type tasks, in DAG life
Divide, the application of resource can be handled targetedly with the stage such as distributing, GPU type operations can be entered into, stage
The correct scheduling of row is with performing;
3rd, framework proposed by the present invention, can adapt to part of nodes in cluster and possesses the isomery blocked GPU equipment and single-point more
Environment, GPU types task is correctly assigned in cluster in the node comprising GPU resource and performed by it, solves tradition and performs GPU
Can not be under isomeric group environment the problem of normal work during task.
Brief description of the drawings
Fig. 1 is that GPU equipment is distributed with discharging.
Fig. 2 is improved framework execution flow chart.
Fig. 3 is mapPartitionsGPU operation principles.
Embodiment
Technical scheme of the present invention is further illustrated below in conjunction with the accompanying drawings.
Fig. 1 is model training and the block diagram of image recognition, is mainly included:
1st, on resource representation, the GPU number of devices included in node is can customize first, and change resource representation association
View, makes it increase the expression to GPU resource.When node starts, node manager initializes the Resources list, and and resource management
Device reports the resource information of the node by heartbeat mechanism.
2nd, in scheduling of resource, GPU is added to the level of resource management platform by the present invention together with CPU, memory source
Manage in queue.
3rd, the Resources list information for needing to discharge is sent to the node administration of response by explorer by heartbeat mechanism
Device, when node manager is detected in resource entity to be released containing GPU resource, can call releaseGPU methods to discharge
GPU resource.Then the resource discharged is further distributed to associated queue by explorer.
4th, in resource isolation, due to Cgroups possess preferable isolation performance and its support to GPU resource carry out every
From the present invention is isolated using Cgroups schemes to GPU resource.
5th, on the dynamic binding of GPU equipment, when including GPU resource in the resource entity for distributing to task, accordingly
Node manager needs to be bound the GPU equipment on node and the resource entity.Running state information of the invention by GPU
It is expressed as<GPU device numbers, resource entity number>List, node manager can when node just starts according to associated profile with
And the GPU facility informations on the node initialize the list.When there is new task requests to use GPU resource, node manager
By searching the list, so as to obtain the GPU facility informations in idle condition, and inter-related task is assigned them to.If section
Possess multiple GPU resources on point manager node and be in idle condition, then the distribution of GPU resource is carried out by round robin.Meanwhile,
The corresponding informance of the resource entity run and GPU resource is preserved into database.
6th, the present invention proposes mapPartitionsGPU operators and mapPartitionsGPURDD, for appointing for GPU
Business is handled.After generation DAG figures, start whether to wrap to identify in current stage, it is necessary to increase field when dividing stage
Containing GPU operation.
7th, when the stage is divided into task-set by task manager, detect whether the stage possesses GPU marks first, such as
Fruit is the container for then including GPU resource for its distribution.
8th, for possessing the task of GPU marks, schedule it in the node comprising GPU equipment and perform.
Bibliography:
[1]Ali Ghodsi,MateiZaharia.Dominant Resource Fairness:Fair Allocation
of Multiple Resource Types[J].Berkeley.
[2]M.Zaharia,M.Chowdhury,T.Das,et al.Resilient distributed datasets:A
fault tolerant abstraction for in-memory cluster computing[C]//Proceedings of
the 9th USENIX conference on Networked Systems Design and Implementation.CA,
USA:USENIX Association,2012
[3]Janki Bhimani,Miriam Leeser,Ningfang Mi.Accelerating K-Means
clustering with parallel implementations and GPU computing[A].2015IEEE High
Performance Extreme Computing Conference(HPEC)[C].2015,1-6.
[4]Huang Chao-Qiang,Yang Shu-Qiang.RDDShare:Reusing Results of Spark
RDD[A].2016IEEE First International Conference on Data Science in Cyberspace
(DSC)[J].2016,290-295.
[5]Jie Zhu 1,Juanjuan Li.GPU-In-Hadoop:Enabling MapReduce Across
Distributed Heterogeneous Platforms.IEEE ICIS 2014[J].2014,1-6。
Claims (6)
1. a kind of concurrent computational system based on Spark and GPU, it is characterised in that including:
Improved resource management platform, its support is scheduled and managed to multi dimensional resources such as GPU, CPU and internal memories;
Improved Spark distributed computing frameworks, it supports the scheduling and execution to GPU type tasks;
(1) the improved resource management platform, including:
Improve YARN explorer and node manager, can effectively perceive isomeric group GPU resource, so as to prop up
Hold the management to cluster GPU resource and scheduling;Wherein, including resource representation model, resource dispatching model, resource preemption model,
The improvement of the binding mechanism of resource isolation mechanism and GPU equipment;
(2) the improved Spark distributed computing frameworks, including:
Spark resource bid and distribution mechanism, job scheduling mechanism and tasks carrying mechanism is improved, it is supported to GPU types
The scheduling and execution of task;By dividing and drawing with the stage such as tasks carrying in resource bid, resource allocation, DAG generations, stage
Enter the mark to GPU resource, its enforcement engine is perceived GPU task, and effectively performed in isomeric group.
2. the concurrent computational system according to claim 1 based on Spark and GPU, it is characterised in that described improved
Resource management platform, it would be preferable to support to being managed and dispatching comprising the multi dimensional resource including GPU resource:
On resource representation model, the GPU number of devices included first in User- defined Node, and resource representation agreement is changed, make
It increases the expression to GPU resource;When node starts, node manager initializes the Resources list, and logical with explorer
Cross the resource information that heartbeat mechanism reports the node;
On resource dispatching model, GPU is added to the hierarchy management queue of resource management platform together with CPU, memory source
In;Scheduling of resource module is modified according to DRF algorithms, it is added scheduling and management to GPU resource;The algorithm is such as
Under:
(1) initializing variable;Wherein, R=<totalCPU,totalGPU,totalMem>Represent cluster CPU, GPU and internal memory money
The total amount in source, C=<usedCPU,usedGPU,usedMem>Represent CPU, GPU for having been consumed and memory source in cluster
Quantity, siRepresent that operation i primary resource accounts for the share of corresponding total resources, Ui=<CPUi,GPUi,Memi>Expression is already allocated to work
Industry i stock number, Di=<CPUi,GPUi,Memi>The stock number that operation i each task needs is represented, operation is being chosen every time
When carrying out resource allocation, following steps are performed successively:
(2) primary resource share s is choseniMinimum Job execution;
(3) if C+Di≤ R, then allocate resources to operation i, updates C=C+Di,Ui=Ui+Di,
si=max { Ui/R};Otherwise, cluster resource can not meet demand, stop distribution;
On resource preemption model, using for every kind of resource is set to each queue in level queue by Resource Scheduler
Limit and lower limit;Resource Scheduler provides the resource allocation of the queue of light load to the queue of other heavier loads to improve cluster
Source utilization rate;But when there is new application program to be submitted to the queue of light load, scheduler meeting resource preemption mechanism withdraws it
Resource shared by his queue, so as to will originally belong to the resource allocation of the queue to it;When resource preemption mechanism occurs, it is necessary to
Discharge GPU resource;This work is completed by node manager, and increasing releaseGPU methods newly here is used to discharge GPU resource;Money
The Resources list information for needing to discharge is sent to the node manager of response, node manager by source manager by heartbeat mechanism
When detecting in resource entity to be released containing GPU resource, releaseGPU methods are called to discharge GPU resource;Then resource
The resource discharged is further distributed to associated queue by manager;
On resource isolation model, GPU resource is isolated using Cgroups schemes;
On the binding mechanism of GPU equipment, when including GPU resource in the resource entity for distributing to the task, accordingly
Node manager needs to be bound the GPU equipment on node and the resource entity;If there is multiple idle GPU on node
Resource, then need selection one to be allocated;GPU running state information is expressed as<GPU device numbers, resource entity number>
List, every data of list identifies the corresponding relation of GPU equipment and related resource entity;Node manager has just been opened in node
The list is initialized according to the GPU facility informations on associated profile and the node when dynamic;
When there is new task requests to use GPU resource, node manager is by searching the list, so as to obtain in idle shape
The GPU facility informations of state, and assign them to inter-related task;If possessing multiple GPU resources on node manager node to be in
Idle condition, then carry out the distribution of GPU resource by round robin;Meanwhile, by the resource entity run and pair of GPU resource
Information is answered to be preserved into database;, can be from database directly in the case where node manager needs to restart
Read the distribution information of GPU equipment, it is to avoid reallocation to node resource.
3. the concurrent computational system according to claim 2 based on Spark and GPU, it is characterised in that described improved
Spark distributed computing frameworks, are improved for Spark kernels, it is supported scheduling and execution to GPU type tasks:
When submitting operation, if the application controller of Spark application programs detects the application program and needs GPU to provide
Source, then in resource bid, required GPU resource is added in resource request description;
The Container of application includes two kinds:CPU types Container and GPU types Container;Because for GPU type tasks,
Processing, transmission and the startup of GPU that CPU completes data are also required to, so GPU types Container is except needing 1 list
Outside the GPU resource of position, in addition it is also necessary to the core cpu of specified quantity;When applying for resource, it is determined that to be applied is two kinds of
Container numbers;Here, each Container CPU core numbers to be included are represented with executorCores,
TotalCores represents the CPU core number of application program, and GPUNum represents the GPU resource quantity of application program, then GPU
Type Container quantity is GPUNum, and non-GPU types Container quantity is (totalCores-GPUNum*
executorCores)/executorCores;Then the memory source quantity further according to setting is judged, detects always interior
Deposit whether quantity disclosure satisfy that the memory amount that all Container need, further to handle;Send after resource request, provide
Source scheduler does not return immediately for it and meets desired resource, and needs the corresponding application controllers of Spark not open close
Heartbeat mechanism is crossed to communicate with explorer, with probe requests thereby to resource whether be assigned;Application controller
After apllied resource is received, it is added into the Resources list to be allocated inside program, to distribute to times specifically performed
Business;
In Spark interfaces, the task to GPU is identified, wherein by mapPartitionsGPU operators with
MapPartitionsGPURDD, for being handled for GPU task;
Spark job scheduler DAGScheduler is after generation DAG figures, when starting to divide stage, increases field to identify
Whether GPU operation is included in current stage;Inside a stage, according to the computational methods run on each RDD whether need
GPU resource is wanted, its internal RDD is divided to for two kinds:Need RDDs of the RDD of GPU resource with not needing GPU resource;If should
Then it is its distribution foot when the subregion of RDD in for this stage distributes resource when in stage comprising the RDD for needing GPU resource
Enough GPU resources, even if possible only one of which RDD needs when calculating;Otherwise, task may be because of in calculating process
Cause to perform failure without workable GPU resource;In order to whether identify in stage comprising the RDD of GPU resource is needed, it is
Stage increases field flagGPU, when flagGPU is true, and showing to include in the stage needs the RDD of GPU resource;By setting
Put flagGPU fields, in the resource allocation of next step, can be recognized by task manager and be its distribute GPU resource.
4. the concurrent computational system according to claim 3 based on Spark and GPU, it is characterised in that inside Spark
Job scheduler DAGScheduler in be identified stage types flow it is as follows:
(1) after DAG generations, stage is divided;When generating stage, the RDD included inside the stage flagGPU is detected
Whether field is true, if it is, illustrating that the stage needs GPU resource in the process of implementation, marks the flagGPU of the stage
Field is true;The foundation of GPU resource distribution is carried out as later task manager;
(2) enforcement engine submits stage algorithm to be a recursive process, and last in DAG figures is submitted first
Stage, then checks whether the father stage of the stage all has been filed on finishing, and starts to perform this time if all submitting
Task-set corresponding to stage;If his father stage have without submitting, recurrence submits his father stage, and equally makes
Above-mentioned inspection;So final result is schemed according to DAG, stage is performed from front to back;
(3) submit after stage, task manager starts the stage being divided into task-set, and is performed to cluster manager dual system application
Required resource;Task quantity included in task-set is identical with RDD number of partitions;Task manager detects this first
Whether stage flagGPU fields are true, if it is, including the container of GPU resource for its distribution;It is being somebody's turn to do
During container distribution, if multiple container can be selected, then judged according to localization strategy, i.e., successively
Select local node, other nodes of this frame and other frame nodes;Then task is started in node in resource, and will
Task intermediate result is stored in storage system with final result;During this, if the container numbers comprising GPU resource
Amount is less than the quantity of GPU type tasks, then temporarily the task of unallocated GPU resource needs to wait, and treats that other tasks carryings are finished, has
GPU resource is then allocated when being in idle condition;
(4) after task is finished, returned resource;The container of recovery is added in list to be allocated, to distribute to it
He uses task.
5. the concurrent computational system according to claim 4 based on Spark and GPU, it is characterised in that propose that one kind is directed to
The effective programming model of GPU type tasks:
In Spark, the data in RDD are made up of several subregions, and it is finally assigned to several sections in units of subregion
Complete to calculate in point;Granularity is performed according to partition data, the type that GPU calculating is carried out using Spark is broadly divided into two kinds:
(1) GPU is completed in units of subregion to calculate, i.e., the data in RDD subregions are all put into GPU to complete in terms of parallel
Calculate, to improve executing efficiency;
(2) GPU is completed in units of single record to calculate, i.e., the data in RDD subregions are put into GPU one by one and complete to calculate,
Acceleration processing is carried out in units of single record.
6. the concurrent computational system according to claim 4 based on Spark and GPU, it is characterised in that newly increase
MapPartitionsGPU operators can perceive GPU type tasks, be handled using partition data as input;The operator it is main
Execution logic is as follows:
(1) GPU equipment is initialized first in method;
(2) then judge to partition data perform granularity be in units of subregion or in units of wall scroll is recorded;If
With divisional unit, then partition data is transferred in GPU video memorys using CUDA API, this process may relate to data format
Conversion, partition data in RDD is converted into the data format that can be handled by GPU;Then GPU is called to data
Carry out parallel computation;After the completion of calculating, output result is transmitted into internal memory;If the granularity that partition data is performed is with list
Bar record is unit, then to one by one to each partitioned record sequential processes;A record data is copied every time to GPU video memorys
In, call GPU to carry out parallel computation to data;After the completion of calculating, output result is copied in internal memory;In whole
After record is disposed, the output result entirely recorded is converted into a partitioned set;
(3) GPU equipment is discharged, and returns to partitioned set iterator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710270400.8A CN107168782A (en) | 2017-04-24 | 2017-04-24 | A kind of concurrent computational system based on Spark and GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710270400.8A CN107168782A (en) | 2017-04-24 | 2017-04-24 | A kind of concurrent computational system based on Spark and GPU |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107168782A true CN107168782A (en) | 2017-09-15 |
Family
ID=59813923
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710270400.8A Pending CN107168782A (en) | 2017-04-24 | 2017-04-24 | A kind of concurrent computational system based on Spark and GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107168782A (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596824A (en) * | 2018-03-21 | 2018-09-28 | 华中科技大学 | A kind of method and system optimizing rich metadata management based on GPU |
CN108652610A (en) * | 2018-06-04 | 2018-10-16 | 成都皓图智能科技有限责任公司 | A kind of non-contact detection method that more popular feelings are jumped |
CN108762921A (en) * | 2018-05-18 | 2018-11-06 | 电子科技大学 | A kind of method for scheduling task and device of the on-line optimization subregion of Spark group systems |
CN109032809A (en) * | 2018-08-13 | 2018-12-18 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Heterogeneous parallel scheduling system based on remote sensing image storage position |
CN109086137A (en) * | 2018-08-06 | 2018-12-25 | 清华四川能源互联网研究院 | GPU concurrent computation resource configuration method and device |
CN109254851A (en) * | 2018-09-30 | 2019-01-22 | 武汉斗鱼网络科技有限公司 | A kind of method and relevant apparatus for dispatching GPU |
CN109743453A (en) * | 2018-12-29 | 2019-05-10 | 出门问问信息科技有限公司 | A kind of multi-screen display method and device |
CN109977306A (en) * | 2019-03-14 | 2019-07-05 | 北京达佳互联信息技术有限公司 | Implementation method, system, server and the medium of advertisement engine |
CN109995965A (en) * | 2019-04-08 | 2019-07-09 | 复旦大学 | A kind of ultrahigh resolution video image real-time calibration method based on FPGA |
CN110018817A (en) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | The distributed operation method and device of data, storage medium and processor |
CN110109747A (en) * | 2019-05-21 | 2019-08-09 | 北京百度网讯科技有限公司 | Method for interchanging data and system, server based on Apache Spark |
CN110134521A (en) * | 2019-05-28 | 2019-08-16 | 北京达佳互联信息技术有限公司 | Method, apparatus, resource manager and the storage medium of resource allocation |
CN110351384A (en) * | 2019-07-19 | 2019-10-18 | 深圳前海微众银行股份有限公司 | Big data platform method for managing resource, device, equipment and readable storage medium storing program for executing |
CN110442446A (en) * | 2019-06-29 | 2019-11-12 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | The method of processing high-speed digital signal data flow in real time |
CN110458294A (en) * | 2019-08-19 | 2019-11-15 | Oppo广东移动通信有限公司 | Model running method, apparatus, terminal and storage medium |
CN110704186A (en) * | 2019-09-25 | 2020-01-17 | 国家计算机网络与信息安全管理中心 | Computing resource allocation method and device based on hybrid distribution architecture and storage medium |
CN110795219A (en) * | 2019-10-24 | 2020-02-14 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Resource scheduling method and system suitable for multiple computing frameworks |
CN110879753A (en) * | 2019-11-19 | 2020-03-13 | 中国移动通信集团广东有限公司 | GPU acceleration performance optimization method and system based on automatic cluster resource management |
CN110955526A (en) * | 2019-12-16 | 2020-04-03 | 湖南大学 | Method and system for realizing multi-GPU scheduling in distributed heterogeneous environment |
CN111240844A (en) * | 2020-01-13 | 2020-06-05 | 星环信息科技(上海)有限公司 | Resource scheduling method, equipment and storage medium |
CN111314401A (en) * | 2018-12-12 | 2020-06-19 | 百度在线网络技术(北京)有限公司 | Resource allocation method, device, system, terminal and computer readable storage medium |
CN111400035A (en) * | 2020-03-04 | 2020-07-10 | 杭州海康威视系统技术有限公司 | Video memory allocation method and device, electronic equipment and storage medium |
CN111656323A (en) * | 2018-01-23 | 2020-09-11 | 派泰克集群能力中心有限公司 | Dynamic allocation of heterogeneous computing resources determined at application runtime |
CN112035261A (en) * | 2020-09-11 | 2020-12-04 | 杭州海康威视数字技术股份有限公司 | Data processing method and system |
CN112711448A (en) * | 2020-12-30 | 2021-04-27 | 安阳师范学院 | Agent technology-based parallel component assembling and performance optimizing method |
CN112835996A (en) * | 2019-11-22 | 2021-05-25 | 北京初速度科技有限公司 | Map production system and method thereof |
CN113515361A (en) * | 2021-07-08 | 2021-10-19 | 中国电子科技集团公司第五十二研究所 | Lightweight heterogeneous computing cluster system facing service |
CN113808001A (en) * | 2021-11-19 | 2021-12-17 | 南京芯驰半导体科技有限公司 | Method and system for single system to simultaneously support multiple GPU (graphics processing Unit) work |
CN114840125B (en) * | 2022-03-30 | 2024-04-26 | 曙光信息产业(北京)有限公司 | Device resource allocation and management method, device resource allocation and management device, device resource allocation and management medium, and program product |
CN112035261B (en) * | 2020-09-11 | 2024-10-01 | 杭州海康威视数字技术股份有限公司 | Data processing method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104407921A (en) * | 2014-12-25 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Time-based method for dynamically scheduling yann task resources |
CN105022670A (en) * | 2015-07-17 | 2015-11-04 | 中国海洋大学 | Heterogeneous distributed task processing system and processing method in cloud computing platform |
EP3067797A1 (en) * | 2015-03-12 | 2016-09-14 | International Business Machines Corporation | Creating new cloud resource instruction set architecture |
CN106506266A (en) * | 2016-11-01 | 2017-03-15 | 中国人民解放军91655部队 | Network flow analysis method based on GPU, Hadoop/Spark mixing Computational frame |
-
2017
- 2017-04-24 CN CN201710270400.8A patent/CN107168782A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104407921A (en) * | 2014-12-25 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Time-based method for dynamically scheduling yann task resources |
EP3067797A1 (en) * | 2015-03-12 | 2016-09-14 | International Business Machines Corporation | Creating new cloud resource instruction set architecture |
CN105022670A (en) * | 2015-07-17 | 2015-11-04 | 中国海洋大学 | Heterogeneous distributed task processing system and processing method in cloud computing platform |
CN106506266A (en) * | 2016-11-01 | 2017-03-15 | 中国人民解放军91655部队 | Network flow analysis method based on GPU, Hadoop/Spark mixing Computational frame |
Non-Patent Citations (2)
Title |
---|
刘德波: "基于YARN的GPU集群系统研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
郑伟: "Spark下MPI/GPU并行计算处理机制的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110018817A (en) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | The distributed operation method and device of data, storage medium and processor |
CN111656323A (en) * | 2018-01-23 | 2020-09-11 | 派泰克集群能力中心有限公司 | Dynamic allocation of heterogeneous computing resources determined at application runtime |
CN108596824A (en) * | 2018-03-21 | 2018-09-28 | 华中科技大学 | A kind of method and system optimizing rich metadata management based on GPU |
CN108762921A (en) * | 2018-05-18 | 2018-11-06 | 电子科技大学 | A kind of method for scheduling task and device of the on-line optimization subregion of Spark group systems |
CN108652610A (en) * | 2018-06-04 | 2018-10-16 | 成都皓图智能科技有限责任公司 | A kind of non-contact detection method that more popular feelings are jumped |
CN109086137A (en) * | 2018-08-06 | 2018-12-25 | 清华四川能源互联网研究院 | GPU concurrent computation resource configuration method and device |
CN109032809A (en) * | 2018-08-13 | 2018-12-18 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Heterogeneous parallel scheduling system based on remote sensing image storage position |
CN109254851A (en) * | 2018-09-30 | 2019-01-22 | 武汉斗鱼网络科技有限公司 | A kind of method and relevant apparatus for dispatching GPU |
CN111314401A (en) * | 2018-12-12 | 2020-06-19 | 百度在线网络技术(北京)有限公司 | Resource allocation method, device, system, terminal and computer readable storage medium |
CN109743453A (en) * | 2018-12-29 | 2019-05-10 | 出门问问信息科技有限公司 | A kind of multi-screen display method and device |
CN109977306B (en) * | 2019-03-14 | 2021-08-20 | 北京达佳互联信息技术有限公司 | Method, system, server and medium for implementing advertisement engine |
CN109977306A (en) * | 2019-03-14 | 2019-07-05 | 北京达佳互联信息技术有限公司 | Implementation method, system, server and the medium of advertisement engine |
CN109995965B (en) * | 2019-04-08 | 2021-12-03 | 复旦大学 | Ultrahigh-resolution video image real-time calibration method based on FPGA |
CN109995965A (en) * | 2019-04-08 | 2019-07-09 | 复旦大学 | A kind of ultrahigh resolution video image real-time calibration method based on FPGA |
CN110109747B (en) * | 2019-05-21 | 2021-05-14 | 北京百度网讯科技有限公司 | Apache Spark-based data exchange method, system and server |
CN110109747A (en) * | 2019-05-21 | 2019-08-09 | 北京百度网讯科技有限公司 | Method for interchanging data and system, server based on Apache Spark |
CN110134521A (en) * | 2019-05-28 | 2019-08-16 | 北京达佳互联信息技术有限公司 | Method, apparatus, resource manager and the storage medium of resource allocation |
CN110442446B (en) * | 2019-06-29 | 2022-12-13 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Method for real-time processing high-speed digital signal data stream |
CN110442446A (en) * | 2019-06-29 | 2019-11-12 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | The method of processing high-speed digital signal data flow in real time |
CN110351384A (en) * | 2019-07-19 | 2019-10-18 | 深圳前海微众银行股份有限公司 | Big data platform method for managing resource, device, equipment and readable storage medium storing program for executing |
CN110458294A (en) * | 2019-08-19 | 2019-11-15 | Oppo广东移动通信有限公司 | Model running method, apparatus, terminal and storage medium |
CN110458294B (en) * | 2019-08-19 | 2022-02-25 | Oppo广东移动通信有限公司 | Model operation method, device, terminal and storage medium |
CN110704186B (en) * | 2019-09-25 | 2022-05-24 | 国家计算机网络与信息安全管理中心 | Computing resource allocation method and device based on hybrid distribution architecture and storage medium |
CN110704186A (en) * | 2019-09-25 | 2020-01-17 | 国家计算机网络与信息安全管理中心 | Computing resource allocation method and device based on hybrid distribution architecture and storage medium |
CN110795219A (en) * | 2019-10-24 | 2020-02-14 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Resource scheduling method and system suitable for multiple computing frameworks |
CN110795219B (en) * | 2019-10-24 | 2022-03-18 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Resource scheduling method and system suitable for multiple computing frameworks |
CN110879753A (en) * | 2019-11-19 | 2020-03-13 | 中国移动通信集团广东有限公司 | GPU acceleration performance optimization method and system based on automatic cluster resource management |
CN110879753B (en) * | 2019-11-19 | 2024-04-05 | 中国移动通信集团广东有限公司 | GPU acceleration performance optimization method and system based on automatic cluster resource management |
CN112835996A (en) * | 2019-11-22 | 2021-05-25 | 北京初速度科技有限公司 | Map production system and method thereof |
CN110955526A (en) * | 2019-12-16 | 2020-04-03 | 湖南大学 | Method and system for realizing multi-GPU scheduling in distributed heterogeneous environment |
CN111240844A (en) * | 2020-01-13 | 2020-06-05 | 星环信息科技(上海)有限公司 | Resource scheduling method, equipment and storage medium |
CN111400035A (en) * | 2020-03-04 | 2020-07-10 | 杭州海康威视系统技术有限公司 | Video memory allocation method and device, electronic equipment and storage medium |
CN112035261A (en) * | 2020-09-11 | 2020-12-04 | 杭州海康威视数字技术股份有限公司 | Data processing method and system |
CN112035261B (en) * | 2020-09-11 | 2024-10-01 | 杭州海康威视数字技术股份有限公司 | Data processing method and system |
CN112711448A (en) * | 2020-12-30 | 2021-04-27 | 安阳师范学院 | Agent technology-based parallel component assembling and performance optimizing method |
CN113515361A (en) * | 2021-07-08 | 2021-10-19 | 中国电子科技集团公司第五十二研究所 | Lightweight heterogeneous computing cluster system facing service |
CN113808001A (en) * | 2021-11-19 | 2021-12-17 | 南京芯驰半导体科技有限公司 | Method and system for single system to simultaneously support multiple GPU (graphics processing Unit) work |
CN114840125B (en) * | 2022-03-30 | 2024-04-26 | 曙光信息产业(北京)有限公司 | Device resource allocation and management method, device resource allocation and management device, device resource allocation and management medium, and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107168782A (en) | A kind of concurrent computational system based on Spark and GPU | |
WO2021208546A1 (en) | Multi-dimensional resource scheduling method in kubernetes cluster architecture system | |
US7945913B2 (en) | Method, system and computer program product for optimizing allocation of resources on partitions of a data processing system | |
CN111344688B (en) | Method and system for providing resources in cloud computing | |
CN101727357B (en) | Method and apparatus for allocating resources in a compute farm | |
US11816509B2 (en) | Workload placement for virtual GPU enabled systems | |
CN112416585B (en) | Deep learning-oriented GPU resource management and intelligent scheduling method | |
US20070226743A1 (en) | Parallel-distributed-processing program and parallel-distributed-processing system | |
US8527988B1 (en) | Proximity mapping of virtual-machine threads to processors | |
US20210390405A1 (en) | Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof | |
CN108572873A (en) | A kind of load-balancing method and device solving the problems, such as Spark data skews | |
CN104298550A (en) | Hadoop-oriented dynamic scheduling method | |
CN102937918A (en) | Data block balancing method in operation process of HDFS (Hadoop Distributed File System) | |
CN114356543A (en) | Kubernetes-based multi-tenant machine learning task resource scheduling method | |
CN115994567A (en) | Asynchronous scheduling method for parallel computing tasks of deep neural network model | |
CN111061565A (en) | Two-stage pipeline task scheduling method and system in Spark environment | |
CN102184124B (en) | Task scheduling method and system | |
CN112306642A (en) | Workflow scheduling method based on stable matching game theory | |
CN115361349B (en) | Resource using method and device | |
CN108446165A (en) | A kind of task forecasting method in cloud computing | |
CN113419827A (en) | High-performance computing resource scheduling fair sharing method | |
JP2012038275A (en) | Transaction calculation simulation system, method, and program | |
CN106250251A (en) | Consider altogether because of and the cloud computing system Reliability Modeling that migrates of virtual-machine fail | |
CN117492934B (en) | Data processing method and system based on cloud service intelligent deployment | |
US20240273346A1 (en) | Self-balancing mixture of experts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170915 |