CN108228351A - Balancing performance dispatching method, storage medium and the electric terminal of GPU - Google Patents
Balancing performance dispatching method, storage medium and the electric terminal of GPU Download PDFInfo
- Publication number
- CN108228351A CN108228351A CN201711460215.1A CN201711460215A CN108228351A CN 108228351 A CN108228351 A CN 108228351A CN 201711460215 A CN201711460215 A CN 201711460215A CN 108228351 A CN108228351 A CN 108228351A
- Authority
- CN
- China
- Prior art keywords
- degree
- gpu
- performance
- pressure
- hydraulic performance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention provides a kind of balancing performance dispatching method, storage medium and the electric terminal of GPU, the method includes:Statistical information and current stream handle cluster allocation plan when collecting the cachings at different levels operation of each sharing application;By the pressure that each sharing application of pressure extractor extraction is afforded on L2 cache and memory bandwidth during trained operation;The pressure of statistical information and the sharing application is as input during the operation that will be collected into, the conflict hydraulic performance decline degree of the sharing application is exported by trained conflict hydraulic performance decline fallout predictor prediction, the expansion hydraulic performance decline degree of the sharing application is exported by trained expansion hydraulic performance decline fallout predictor prediction;According to the conflict hydraulic performance decline degree of the sharing application and hydraulic performance decline degree is expanded, obtain the unbalanced degree of the performance of GPU and determines to redistribute the brand new allocation plan of stream handle of stream handle cluster according to unbalanced degree.Present invention can ensure that between sharing application hydraulic performance decline degree equilibrium.
Description
Technical field
The present invention relates to processor technical fields, and more particularly to GPU technical fields, specially a kind of performance of GPU is equal
Weigh dispatching method, storage medium and electric terminal.
Background technology
With a large amount of deployment of the compute-intensive applications such as such as speech recognition, machine translation, personal personal assistant, mainstream
Private data center or public cloud platform have begun largely to cope with traditional CPU calculating energy using as coprocessors such as GPU
Hypodynamic problem.GPU is most initially the application specific processor designed for graph image calculating, and since it has tradition CPU
The concurrency that can not be equal to, more and more non-graphic image applications move to GPU platform to meet in terms of it increases rapidly
Calculation demand.But research shows that the application of non-graphic image does not often have enough degree of parallelisms so that the hardware of GPU to be made full use of to provide
Source, so as to cause the waste of hardware resource.On the other hand, due to the development of GPU architecture and technique, multiprocessing is more and more flowed
Device (Streaming Multiprocessor, SM) is integrated among one piece of GPU so that more prominent the problem of the wasting of resources
Go out.
For this purpose, spatial parallelism, that is, multiple applications run share one piece of GPU simultaneously, it is suggested to solve the problems, such as appeal.Phase
Close research shows that, the resource utilization of GPU and the overall performance of system can be greatly improved using spatial parallelism.It is answered when multiple
During with shared GPU, they can compete 1) stream multiprocessor each other, 2) two level shared buffer memory and 3) global memory's bandwidth.Cause
This, the performance when the performance of each application in the case of shared GPU monopolizes monoblock GPU compared to it can all decline.Together
When often there is different functional improvement and for shared L2 cache and global memory's bandwidth contention due to different applications
Susceptibility, so the scheduling scheme that multiprocessor is shunted between each application employed in tradition does not ensure that respectively
There is different performance to decline degree for the fairness of performance between the application of a shared GPU, i.e., each application.
For the cloud platform of multi-tenant, ensureing the fairness of performance between each sharing application has important meaning
Justice.If the fairness of performance can not be guaranteed between each sharing application, according to the relevant theory of game theory, platform user
Tend to resist with other users share one piece of GPU so that by using spatial parallelism with promoted GPU resource utilization rate and
The chance of systematic entirety energy is greatly decreased, and is unfavorable for and other platform competitions.Therefore spatial parallelism skill upgrading is being used to provide
On the basis of source utilization rate and systematic entirety energy, ensureing the fairness of performance between each sharing application has important meaning
Justice.
Invention content
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of scheduling of the balancing performance of GPU
Method, storage medium and electric terminal, by the prediction of accurate hydraulic performance decline degree and dynamic SM allocation schedules, to ensure
On the basis of spatial parallelism skill upgrading resource utilization and systematic entirety energy is used, at the same ensure each sharing application it
Between performance fairness.
To achieve the above object and his related purpose, a kind of balancing performance dispatching method of GPU of the present invention, the performance are equal
Weighing apparatus dispatching method includes:Statistical information and the distribution of current stream handle cluster when collecting the cachings at different levels operation of each sharing application
Scheme;Each sharing application is extracted by pressure extractor during trained operation to be afforded on L2 cache and memory bandwidth
Pressure;In the various shared resources that statistical information during the operation of the sharing application being collected into and the sharing application are afforded
Pressure as input, the conflict hydraulic performance decline journey of the sharing application is exported by trained conflict hydraulic performance decline fallout predictor prediction
Degree is exported the expansion hydraulic performance decline degree of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction;According to prediction
The conflict hydraulic performance decline degree of the sharing application of output and expansion hydraulic performance decline degree obtain the unbalanced degree of the performance of GPU
And determine to redistribute the brand new allocation plan of stream handle of stream handle cluster according to the unbalanced degree.
In one embodiment of the invention, the training process of pressure extractor includes when training the operation:It separately designs
For L2 cache and multiple pressure measurement programs of memory bandwidth;It separately designs for the multiple of L2 cache and memory bandwidth
Pressure generator;Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and collect corresponding
Statistical information and the pressure value generated in L2 cache and memory bandwidth is measured during operation;The statistical information during operation that will be collected into
As input, obtained pressure value will be measured as output, the preset neural network of training, pressure is extracted when forming the operation
Device.
In one embodiment of the invention, the training conflict hydraulic performance decline fallout predictor and the training expansion hydraulic performance decline
The training process of fallout predictor includes:Choose multiple application programs;Enable multiple application programs and multiple pressure generators
Statistical information when shared GPU runs and collects corresponding operation measures the pressure value generated in L2 cache and memory bandwidth, answers
With the conflict hydraulic performance decline degree of program and the expansion hydraulic performance decline degree of application program;Letter is counted during the operation that will be collected into
The pressure value that breath and measurement obtain is as input, and using the conflict hydraulic performance decline degree of application program as exporting, training is preset
Neural network forms the conflict hydraulic performance decline fallout predictor;The pressure that statistical information and measurement obtain during the operation that will be collected into
Value is as input, and using the expansion hydraulic performance decline degree of application program as output, the preset neural network of training forms the punching
Prominent hydraulic performance decline fallout predictor.
In one embodiment of the invention, statistical information will measure obtained pressure as input during the operation that will be collected into
Force value is as output, and the preset neural network of training, pressure extractor specifically includes when forming the operation:The fortune that will be collected into
Statistical information will measure obtained L2 cache pressure value as output, the preset neural network of training, shape as input during row
Into L2 cache pressure extractor;Statistical information is as input during the operation that will be collected into, the memory bandwidth pressure that measurement is obtained
Force value forms memory bandwidth pressure extractor as output, the preset neural network of training.
In one embodiment of the invention, the L2 cache pressure extractor and the memory bandwidth pressure extractor make
The neural network includes an input layer, two hidden layers and an output layer;Wherein, the hidden layer neuron
Number is equal to the quantity of input;The neural network intensifies function using LeakyRelu functions.
In one embodiment of the invention, the conflict hydraulic performance decline degree is i.e. in the fixed number of situation of stream handle cluster
Under, application program is in the decline journey there are performance when being competed on L2 cache and memory bandwidth relative to performance when not competing
Degree;For the expansion hydraulic performance decline degree i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, one should
With the decline degree of performance of the program when performance during given number stream handle cluster is used to monopolize monoblock GPU relative to it.
In one embodiment of the invention, the unbalanced degree for obtaining the performance of GPU specifically includes:It is described to be exported according to prediction
The sharing application conflict hydraulic performance decline degree and expand hydraulic performance decline degree and obtain the actual performance of sharing application and decline journey
Degree, and the unbalanced degree of the performance according to actual performance decline degree acquisition GPU;Wherein, the actual performance declines journey
Degree is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.
It is described to determine to redistribute at the stream of stream handle cluster according to the unbalanced degree in one embodiment of the invention
The reason brand new allocation plan information of device specifically includes:If it is unbalanced degree be more than setting threshold value, redistributed, into
When row is redistributed, decline the application of degree minimum and the application of maximum using current performance is redistributed every time with preset algorithm
1 stream handle cluster carry out the unbalanced degree of reduction gradually, when the distance of new allocation plan and initial allocation plan is more than
During specific threshold, it is the brand new allocation plan of stream handle to determine current new allocation plan.
The embodiment of the present invention also provides a kind of storage medium, including GPU processors and memory, the memory storage
There is program instruction, the GPU processors operation program instruction realizes method as described above.
The embodiment of the present invention also provides a kind of electric terminal, including GPU processors and memory, the memory storage
There is program instruction, the GPU processors operation program instruction realizes method as described above.
As described above, the present invention a kind of GPU balancing performance dispatching method, storage medium and electric terminal, have with
Lower advantageous effect:
The present invention provides a set of balancing performance scheduling mechanisms that multitask GPU is shared towards preemptive type.The mechanism can be with
Under the premise of hardware supported is not increased, it is ensured that the base of GPU resource utilization rate and systematic entirety energy is promoted using spatial parallelism
On plinth, it is further ensured that the equilibrium of hydraulic performance decline degree between sharing application, achievement of the invention can be used directly in multi-tenant
Publicly-owned cloud environment in, to ensure the fairness of performance between each shared user.
Description of the drawings
Fig. 1 is shown as a kind of flow diagram of the balancing performance dispatching method of GPU of the present invention.
Fig. 2 is shown as a kind of architecture diagram of the balancing performance dispatching method of GPU of the present invention.
Fig. 3 is shown as a kind of hardware system structure figure of the balancing performance dispatching method application of GPU of the present invention.
Fig. 4 is shown as the flow diagram of the balancing performance dispatching method off-line training step of the GPU of the present invention.
Fig. 5 is shown as the flow diagram in the balancing performance dispatching method on-line scheduling stage of the GPU of the present invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from
Various modifications or alterations are carried out under the spirit of the present invention.
The purpose of the present embodiment is that providing balancing performance dispatching method, storage medium and the electric terminal of a kind of GPU, lead to
Accurate hydraulic performance decline degree prediction and dynamic SM allocation schedules are crossed, to ensure using spatial parallelism skill upgrading resource
On the basis of utilization rate and systematic entirety energy, while ensure the fairness of performance between each sharing application.
The balancing performance dispatching method of GPU of the present invention described in detail below a kind of, storage medium and electric terminal
Principle and embodiment makes those skilled in the art not need to the performance that creative work is appreciated that a kind of GPU of the present invention
Equalization scheduling method, storage medium and electric terminal.
Specifically, the present embodiment aims at the balancing performance that multitask GPU is shared based on preemptive type of a low overhead
Scheduling mechanism on the basis of resource utilization and systematic entirety energy is promoted, ensures the public affairs of performance between each sharing application
Levelling.
Balancing performance dispatching method, storage medium and the electric terminal of the GPU of the present embodiment are described in detail below.
As shown in Figure 1, the present embodiment provides the balancing performance dispatching method of GPU a kind of, the balancing performance scheduling of the GPU
Method includes the following steps:
Step S110, statistical information and current stream handle cluster point when collecting the cachings at different levels operation of each sharing application
With scheme;
Step S120 extracts each sharing application in L2 cache and memory band by pressure extractor during trained operation
The pressure afforded on width;
Step S130 is afforded statistical information during the operation of the sharing application being collected into and the sharing application various
Pressure in shared resource is exported the conflict of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input
Hydraulic performance decline degree is exported the expansion hydraulic performance decline journey of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction
Degree;
Step S140 according to the conflict hydraulic performance decline degree of the sharing application of prediction output and expands hydraulic performance decline journey
Degree obtains the unbalanced degree of the performance of GPU and determines to redistribute the stream handle of stream handle cluster according to the unbalanced degree
Brand new allocation plan.
The balancing performance dispatching method of the GPU of the present embodiment is described in detail below.
In the present embodiment GPU balancing performance dispatching method application bottom hardware be one support SM ranks seize it is more
Task GPU.As shown in Fig. 2, a GPU usually by several SM (streaming multiprocessor, stream handle cluster,
Also cry GPU big cores) it forms, each SM possesses the level cache of oneself private, and SM used shares a two level shared buffer memory.When
Two application App-a and App-b are when supporting to run on the multitask GPU that seize for one, using can compete SM each other, two
Grade shared buffer memory and global memory's bandwidth.
As shown in figure 3, it is the software for the balancing performance scheduling mechanism that the present invention entirely shares multitask GPU based on preemptive type
Organization Chart.The software architecture of the runtime system is divided into four layers:Information extraction layer, pressure extract layer, performance prediction layer during operation
With allocation schedule layer.
Step S110, statistical information and current stream handle cluster point when collecting the cachings at different levels operation of each sharing application
With scheme.
Statistical information and current when information extraction layer collects the cachings at different levels operation of each sharing application when passing through operation
Stream handle cluster allocation plan.Information extraction layer is used for the cachings at different levels provided for statistical module on GPU chips during operation
Information and the allocation plan of current SM extract.These message reflections feature of each sharing application, subsequent place
Manage these information that place one's entire reliance upon.
Step S120 extracts each sharing application in L2 cache and memory band by pressure extractor during trained operation
The pressure afforded on width.
Pressure is used to be reflected in the severity competed in shared resource.
In this present embodiment, as shown in figure 4, the training process of pressure extractor includes when training the operation:It sets respectively
Meter is for L2 cache and multiple pressure measurement programs of memory bandwidth;It separately designs for the more of L2 cache and memory bandwidth
A pressure generator;Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and collect corresponding
Operation when statistical information and measure the pressure value generated in L2 cache and memory bandwidth;Letter is counted during the operation that will be collected into
Breath will measure obtained pressure value as output, the preset neural network of training, pressure carries when forming the operation as input
Take device.
Quantitative analysis is carried out to pressure, is had for each shared resource (two level shared buffer memory or memory bandwidth)
One exclusive pressure test program.When an application program and pressure test procedure sharing GPU, the application program is in correspondence
Shared resource on the pressure that generates be defined as the decline degree of corresponding pressure test program feature (relative to exclusive monoblock
Performance during GPU).One is applied the pressure born in a kind of shared resource to be defined as other application in the shared resource
The pressure of generation.
In this present embodiment, statistical information is as input during the operation that will be collected into, will measure obtained pressure value as
Output, the preset neural network of training, pressure extractor specifically includes when forming the operation:It is counted during the operation that will be collected into
Information will measure obtained L2 cache pressure value as output, the preset neural network of training forms two level and delays as input
Deposit pressure extractor;Statistical information is as input during the operation that will be collected into, will measure obtained memory bandwidth pressure value as
Output, the preset neural network of training, forms memory bandwidth pressure extractor.
That is, pressure extract layer includes the extraction of two level shared buffer memory pressure and the extraction of memory bandwidth pressure.
The extraction of two level shared buffer memory pressure is responsible for extracting each application program at runtime and bearing in two level shared buffer memory
The pressure arrived.At runtime, it is carried by the good neural network of an off-line training to complete the real-time of pressure on L2 cache
It takes.Extract layer collected information when the input of the neural network is operation.In the stage of training neural network, pass through acquisition
Information and will during operation when a series of L2 cache pressure generator and two level shared buffer memory pressometer share GPU
For information as input, the pressure value that pressometer is measured completes the training of neural network as label during these operations.Pressure
Forcer is a kind of more enough application programs that particular size pressure is generated in corresponding shared resource.
The extraction of memory bandwidth pressure is responsible for extracting the pressure that each application program affords in memory bandwidth at runtime.
At runtime, the extract real-time of the pressure in memory bandwidth is completed by the good neural network of an off-line training.The nerve
Extract layer collected information when the input of network is operation.Training neural network stage, it is a series of by acquiring
Information and information work when these are run during operation when memory bandwidth generator and memory bandwidth pressometer share GPU
For input, the pressure value that pressometer is measured completes the training of neural network as label.
In this present embodiment, the L2 cache pressure extractor and the memory bandwidth pressure extractor use described
Neural network includes an input layer, two hidden layers and an output layer;Wherein, the number of the hidden layer neuron is equal to
The quantity of input;The neural network intensifies function using LeakyRelu functions.
Step S130 is afforded statistical information during the operation of the sharing application being collected into and the sharing application various
Pressure in shared resource is exported the conflict of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input
Hydraulic performance decline degree is exported the expansion hydraulic performance decline journey of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction
Degree.
In this present embodiment, the conflict hydraulic performance decline degree is, application fixed number of in stream handle cluster
Program is in the decline degree there are performance when being competed on L2 cache and memory bandwidth relative to performance when not competing;It is described
Hydraulic performance decline degree is expanded i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, an application program exists
The decline degree of performance when performance during using given number stream handle cluster monopolizes monoblock GPU relative to it.
As shown in figure 4, in this present embodiment, under the training conflict hydraulic performance decline fallout predictor and the training expansion performance
The training process of drop fallout predictor includes:
Choose multiple application programs;Multiple application programs is enabled to share GPU operations simultaneously with multiple pressure generators
Statistical information when collecting corresponding operation measures the conflict of pressure value, application program generated in L2 cache and memory bandwidth
The expansion hydraulic performance decline degree of hydraulic performance decline degree and application program;Statistical information and measurement obtain during the operation that will be collected into
Pressure value as input, using the conflict hydraulic performance decline degree of application program as exporting, the preset neural network of training, formation
The conflict hydraulic performance decline fallout predictor;The pressure value that statistical information and measurement obtain during the operation that will be collected into, will as input
It is pre- to form the conflict hydraulic performance decline as output, the preset neural network of training for the expansion hydraulic performance decline degree of application program
Survey device.
That is, in this present embodiment, performance prediction layer includes conflict performance prediction and expands performance prediction.
1) conflict performance prediction is responsible for predicting the conflict hydraulic performance decline degree of application program.The hydraulic performance decline degree that conflicts exists
In the case of SM is fixed number of, there are on L2 cache and memory bandwidth compete when performance relative to performance when not competing
Decline degree.At runtime, the prediction of conflict hydraulic performance decline is completed by the good neural network of an off-line training.The god
What the collected information of extract layer and pressure extract layer exported when the input through network is operation is on L2 cache and interior
Deposit the pressure in bandwidth.In the stage of training neural network, some are collected first and establishment one possesses extensive representative instruction
Practice collection.By information during operation when acquiring the Application share GPU in a series of pressure generator and training set and by this
The corresponding pressure value that information and training application afford during a little operations is as input, the practical conflict hydraulic performance decline of training application
Degree completes the training of neural network as label.
2) performance prediction is expanded to be responsible for predicting the expansion hydraulic performance decline degree of application program and expand performance variation degree.It opens up
Hydraulic performance decline degree is opened up i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, applies for one and is using spy
The decline degree of performance when performance during fixed number mesh SM monopolizes monoblock GPU relative to it.Performance variation degree is expanded not have
When using specific SM numbers in the case of having the competition of L2 cache and memory bandwidth, brought by increasing or decreasing a SM
Hydraulic performance decline degree variation.At runtime, conflict hydraulic performance decline is completed by the good neural network of an off-line training
The prediction of degree and conflict performance variation degree.The input of the neural network for operation when the collected information of extract layer and
The pressure on L2 cache and in memory bandwidth of pressure extract layer output.In the stage of training neural network, collect first
Some simultaneously set up one just with extensive representative training set.It is applied by acquiring each training in different SM distribution conditions
Under, it shares information when GPU is institute's collected operation with each pressure generator and affords these information and training application
Corresponding pressure value as input, the practical expansion hydraulic performance decline degree of training application and expand performance variation degree as mark
Label complete the training of neural network.
Wherein, the neural network that conflict performance prediction and expansion performance prediction use is similar to used in pressure extract layer
Neural network.
Step S140 according to the conflict hydraulic performance decline degree of the sharing application of prediction output and expands hydraulic performance decline journey
Degree obtains the unbalanced degree of the performance of GPU and determines to redistribute the stream handle of stream handle cluster according to the unbalanced degree
Brand new allocation plan.
In this present embodiment, the unbalanced degree for obtaining the performance of GPU specifically includes:It is described according to prediction output this is shared
The conflict hydraulic performance decline degree of application and the actual performance decline degree for expanding hydraulic performance decline degree acquisition sharing application, and according to
The actual performance declines the unbalanced degree that degree obtains the performance of GPU;Wherein, the actual performance declines degree and is equal to accordingly
Conflict hydraulic performance decline degree with expand hydraulic performance decline degree product.
Specifically, in this present embodiment, it is described to determine to redistribute at the stream of stream handle cluster according to the unbalanced degree
The reason brand new allocation plan information of device specifically includes:If it is unbalanced degree be more than setting threshold value, redistributed, into
When row is redistributed, decline the application of degree minimum and the application of maximum using current performance is redistributed every time with preset algorithm
1 stream handle cluster carry out the unbalanced degree of reduction gradually, when the distance of new allocation plan and initial allocation plan is more than
During specific threshold, it is the brand new allocation plan of stream handle to determine current new allocation plan.
Allocation schedule layer is responsible for according to the true of the foreseeable output prediction application program of conflict performance prediction and expansion
Hydraulic performance decline degree.It is the competition on there are L2 cache and in memory bandwidth that actual performance, which declines degree, using current
The decline degree of performance when performance during the SM numbers of distribution monopolizes monoblock GPU relative to it.It is defined it is found that actual performance
Decline degree is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.Decline in prediction actual performance
On the basis of degree, distributed to by using the adjustment of a didactic greedy algorithm gradually each application SM number with
Reduce the unbalanced degree of hydraulic performance decline degree between each application.The unbalanced degree of wherein hydraulic performance decline degree is defined as each common
Enjoy the difference between the maximum value and minimum value of the hydraulic performance decline degree of application.Specific dispatching algorithm is as shown in table 1 below.
Table 1
The algorithm is periodically called.It first determines whether that whether unbalanced degree is already less than specified under current distribution
Threshold value, if it is algorithm directly terminate.If current unbalanced degree is more than the threshold value specified, it is by using greed
The reduction that 1 SM that method redistributes application and maximum application that current performance declines degree minimum every time comes gradually is uneven
Weighing apparatus degree.The distance between two groups of distribution are defined as each maximum value using SM number change amounts.When new allocation plan with
When the distance of initial allocation plan is more than specific threshold, algorithm terminates immediately.This allows for the accuracy of predicted value with dividing
Deviate the distance dependent with original scheme with scheme.
To sum up, in the use of the present invention, first carrying out off-line training, as shown in figure 4, training flow is:
1) according to the corresponding pressure measurement program of the architecture design of target GPU:Separately design for two level shared buffer memory and
The pressure measurement program of global memory's bandwidth, to quantify the severity competed in various shared resources.
2) according to the corresponding design pressure generator of the architecture design of target GPU:It separately designs for L2 cache and complete
The pressure generator of office's memory bandwidth, for generating the pressure of particular size in specific shared resource.
3) pressure measurement program shares operation with pressure generator:Allow various pressure measurement programs and various pressure generators
Shared GPU operations, information and the pressure value generated when collecting corresponding operation.
4) pressure extractor during training operation:Information will be measured as input during the operation that will be collected on last stage
Obtained pressure value is extracted as output one neural network of training for online pressure.
5) one group of application program with adequate representation is collected:One group of application program is collected, fully covers mainstream
Various situations collect representative application program according to the application scenarios of target and form training set.
6) application program shares operation with pressure generator:Application program is allowed to share operation with pressure generator, collects fortune
Statistical information during row, the conflict hydraulic performance decline of application program expand hydraulic performance decline, the pressure that pressure generator generates.
7) training conflict hydraulic performance decline fallout predictor:The information during operation of middle collection on last stage and pressure generator are generated
Pressure as input, using the conflict hydraulic performance decline of application program as one neural network of label training, rushed for online
Prominent hydraulic performance decline prediction.
8) hydraulic performance decline fallout predictor is expanded in training:The information during operation of middle collection on last stage and pressure generator are generated
Pressure as input, using the expansion hydraulic performance decline of application program as one neural network of label training, opened up for online
Open up hydraulic performance decline prediction.
After the completion of off-line training, you can carry out on-line scheduling, on-line scheduling flow is as shown in Figure 5:
On-line scheduling flow:
1) information is collected when running:Statistical information and current when collecting the cachings at different levels operation of each sharing application
SM allocation plans.
2) pressure is extracted:Collected information is extracted each sharing application and is delayed in two level as input in inciting somebody to action on last stage
The pressure deposited and afforded in memory bandwidth.
3) conflict hydraulic performance decline prediction:By information during the operation of the sharing application being collected into and its afforded various common
The pressure in resource is enjoyed as input, the conflict hydraulic performance decline degree of the application is exported by conflict hydraulic performance decline fallout predictor.
4) hydraulic performance decline prediction is expanded:By information during the operation of the sharing application being collected into and its afforded various common
The pressure in resource is enjoyed as input, the expansion hydraulic performance decline degree of the application is exported by expansion hydraulic performance decline fallout predictor.
5) SM is redistributed:The prediction of conflict hydraulic performance decline and expansion performance prediction obtained according to prediction, is calculated one
The actual performance of a sharing application declines degree, and further calculates the unbalanced degree of the performance of system accordingly.It is if uneven
Weighing apparatus degree has been more than specific threshold value, then is redistributed.Non- timing again is being carried out, first using the method for greed gradually
The unbalanced degree of current system is reduced, and is thus obtained to the current more preferably allocation plan of a ratio, is redistributed.
The embodiment of the present invention also provides a kind of storage medium, including GPU processors and memory, the memory storage
There is program instruction, the GPU processors operation program instruction realizes the balancing performance dispatching method of GPU as described above.It is above-mentioned
The balancing performance dispatching method of GPU is described in detail, details are not described herein.
The embodiment of the present invention also provides a kind of electric terminal, and the electric terminal is, for example, server, and the electronics is whole
End includes GPU processors and memory, and the memory has program stored therein instruction, and the GPU processors run program instruction reality
The now balancing performance dispatching method of GPU as described above.It is above-mentioned that the balancing performance dispatching method of GPU has been carried out specifically
Bright, details are not described herein.
In conclusion the present invention provides a set of balancing performance scheduling mechanisms that multitask GPU is shared towards preemptive type.It should
Mechanism can be under the premise of hardware supported not be increased, it is ensured that promotes GPU resource utilization rate using spatial parallelism and system is whole
On the basis of performance, it is further ensured that the equilibrium of hydraulic performance decline degree between sharing application, achievement of the invention can be used directly
In the publicly-owned cloud environment of multi-tenant, to ensure the fairness of performance between each shared user.So the present invention effectively overcomes
Various shortcoming of the prior art and have high industrial utilization.
The principle of the present invention and effect is only illustrated in above-described embodiment, and is not intended to limit the present invention.It is any to be familiar with
The personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Therefore,
Such as those of ordinary skill in the art without departing from disclosed spirit with being completed under technological thought
All equivalent modifications or change, should by the present invention claim be covered.
Claims (10)
1. the balancing performance dispatching method of a kind of GPU, which is characterized in that the balancing performance dispatching method includes:
Statistical information and current stream handle cluster allocation plan when collecting the cachings at different levels operation of each sharing application;
Each sharing application is extracted by pressure extractor during trained operation to be afforded on L2 cache and memory bandwidth
Pressure;
In the various shared resources that statistical information during the operation of the sharing application being collected into and the sharing application are afforded
Pressure is exported the conflict hydraulic performance decline journey of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input
Degree is exported the expansion hydraulic performance decline degree of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction;
According to the conflict hydraulic performance decline degree of the sharing application of prediction output and hydraulic performance decline degree is expanded, obtains the property of GPU
The unbalanced degree of energy and the brand new allocation plan of stream handle for determining to redistribute stream handle cluster according to the unbalanced degree.
2. the balancing performance dispatching method of GPU according to claim 1, which is characterized in that pressure during the training operation
The training process of extractor includes:
Separately design multiple pressure measurement programs for L2 cache and memory bandwidth;
Separately design multiple pressure generators for L2 cache and memory bandwidth;
Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and unite when collecting corresponding operation
Meter information simultaneously measures the pressure value generated in L2 cache and memory bandwidth;
Statistical information will measure obtained pressure value as output, the preset god of training as input during the operation that will be collected into
Through network, pressure extractor when forming the operation.
3. the balancing performance dispatching method of GPU according to claim 2, which is characterized in that under the training conflict performance
Drop fallout predictor and the training training process for expanding hydraulic performance decline fallout predictor include:
Choose multiple application programs;
Multiple application programs and multiple pressure generators is enabled to share statistics when GPU runs and collects corresponding operation to believe
Breath measures the pressure value generated in L2 cache and memory bandwidth, the conflict hydraulic performance decline degree of application program and using journey
The expansion hydraulic performance decline degree of sequence;
The pressure value that statistical information and measurement obtain during the operation that will be collected into, will be under the conflict performance of application program as input
Drop degree forms the conflict hydraulic performance decline fallout predictor as output, the preset neural network of training;
The pressure value that statistical information and measurement obtain during the operation that will be collected into, will be under the expansion performance of application program as input
Drop degree forms the conflict hydraulic performance decline fallout predictor as output, the preset neural network of training.
4. the balancing performance dispatching method of GPU according to claim 2, which is characterized in that unite during the operation that will be collected into
Information is counted as input, obtained pressure value will be measured as output, training preset neural network is pressed when forming the operation
Power extractor specifically includes:
Statistical information will measure obtained L2 cache pressure value as output, training as input during the operation that will be collected into
Preset neural network forms L2 cache pressure extractor;
Statistical information will measure obtained memory bandwidth pressure value as output, training as input during the operation that will be collected into
Preset neural network forms memory bandwidth pressure extractor.
5. the balancing performance dispatching method of GPU according to claim 4, which is characterized in that the L2 cache pressure carries
The neural network that device and the memory bandwidth pressure extractor use is taken to include an input layer, two hidden layers and one
Output layer;Wherein, the number of the hidden layer neuron is equal to the quantity of input;The neural network intensifies what function used
It is LeakyRelu functions.
6. the balancing performance dispatching method of the GPU according to claim 1 or 3, which is characterized in that the conflict hydraulic performance decline
Degree be it is fixed number of in stream handle cluster, application program there are on L2 cache and memory bandwidth compete when
Performance relative to performance when not competing decline degree;The expansion hydraulic performance decline degree i.e. completely without shared buffer memory and
In the case of competition in memory bandwidth, an application program is using performance during given number stream handle cluster relative to it
The decline degree of performance during exclusive monoblock GPU.
7. the balancing performance dispatching method of GPU according to claim 1, which is characterized in that obtain the unevenness of the performance of GPU
Weighing apparatus degree specifically includes:
The conflict hydraulic performance decline degree and expansion hydraulic performance decline degree of the sharing application according to prediction output obtain shared
The actual performance of application declines degree, and the unbalanced degree of the performance according to actual performance decline degree acquisition GPU;Its
In, the actual performance declines degree and is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.
8. the balancing performance dispatching method of the GPU according to claim 1 or 7, which is characterized in that described according to the unevenness
Weighing apparatus degree determines that the brand new allocation plan information of the stream handle for redistributing stream handle cluster specifically includes:
If unbalanced degree has been more than the threshold value of setting, redistributed, when being redistributed, imputed using with pre-
Method redistributes current performance and declines 1 stream handle cluster of the application of degree minimum and the application of maximum come subtracting gradually every time
Few unbalanced degree when the distance of new allocation plan and initial allocation plan is more than specific threshold, determines current new
Allocation plan is the brand new allocation plan of stream handle.
9. a kind of storage medium, including GPU processors and memory, the memory has program stored therein instruction, the GPU processing
Device operation program instruction is realized such as claim 1 to claim 8 any one of them method.
10. a kind of electric terminal, including GPU processors and memory, the memory has program stored therein instruction, at the GPU
Device operation program instruction is managed to realize such as claim 1 to claim 8 any one of them method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711460215.1A CN108228351B (en) | 2017-12-28 | 2017-12-28 | GPU performance balance scheduling method, storage medium and electronic terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711460215.1A CN108228351B (en) | 2017-12-28 | 2017-12-28 | GPU performance balance scheduling method, storage medium and electronic terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108228351A true CN108228351A (en) | 2018-06-29 |
CN108228351B CN108228351B (en) | 2021-07-27 |
Family
ID=62646577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711460215.1A Active CN108228351B (en) | 2017-12-28 | 2017-12-28 | GPU performance balance scheduling method, storage medium and electronic terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108228351B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020056620A1 (en) * | 2018-09-19 | 2020-03-26 | Intel Corporation | Hybrid virtual gpu co-scheduling |
CN110929627A (en) * | 2019-11-18 | 2020-03-27 | 北京大学 | Image recognition method of efficient GPU training model based on wide-model sparse data set |
CN117762654A (en) * | 2023-12-22 | 2024-03-26 | 摩尔线程智能科技(北京)有限责任公司 | Method, device, equipment and storage medium for collecting GPU information by application program |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104461928A (en) * | 2013-09-16 | 2015-03-25 | 华为技术有限公司 | Method and device for dividing caches |
CN105487927A (en) * | 2014-09-15 | 2016-04-13 | 华为技术有限公司 | Resource management method and device |
CN106383792A (en) * | 2016-09-20 | 2017-02-08 | 北京工业大学 | Missing perception-based heterogeneous multi-core cache replacement method |
US20170352120A1 (en) * | 2007-07-13 | 2017-12-07 | Cerner Innovation, Inc. | Claim processing validation system |
-
2017
- 2017-12-28 CN CN201711460215.1A patent/CN108228351B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170352120A1 (en) * | 2007-07-13 | 2017-12-07 | Cerner Innovation, Inc. | Claim processing validation system |
CN104461928A (en) * | 2013-09-16 | 2015-03-25 | 华为技术有限公司 | Method and device for dividing caches |
CN105487927A (en) * | 2014-09-15 | 2016-04-13 | 华为技术有限公司 | Resource management method and device |
CN106383792A (en) * | 2016-09-20 | 2017-02-08 | 北京工业大学 | Missing perception-based heterogeneous multi-core cache replacement method |
Non-Patent Citations (2)
Title |
---|
RODRIGO ESCOBAR: "performace prediction of parallel applications based on small-scale executions", 《IEEE》 * |
SIQI WANG: "CGPredict:Embedded GPU performance estimation from single-threaded application", 《ACM》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020056620A1 (en) * | 2018-09-19 | 2020-03-26 | Intel Corporation | Hybrid virtual gpu co-scheduling |
US11900157B2 (en) | 2018-09-19 | 2024-02-13 | Intel Corporation | Hybrid virtual GPU co-scheduling |
CN110929627A (en) * | 2019-11-18 | 2020-03-27 | 北京大学 | Image recognition method of efficient GPU training model based on wide-model sparse data set |
CN110929627B (en) * | 2019-11-18 | 2021-12-28 | 北京大学 | Image recognition method of efficient GPU training model based on wide-model sparse data set |
CN117762654A (en) * | 2023-12-22 | 2024-03-26 | 摩尔线程智能科技(北京)有限责任公司 | Method, device, equipment and storage medium for collecting GPU information by application program |
Also Published As
Publication number | Publication date |
---|---|
CN108228351B (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hung et al. | Wide-area analytics with multiple resources | |
CN106776005A (en) | A kind of resource management system and method towards containerization application | |
Peddi et al. | An intelligent cloud-based data processing broker for mobile e-health multimedia applications | |
KR20210082210A (en) | Creating an Integrated Circuit Floor Plan Using Neural Networks | |
CN106445681B (en) | Distributed task dispatching system and method | |
CN111274036B (en) | Scheduling method of deep learning task based on speed prediction | |
CN108228351A (en) | Balancing performance dispatching method, storage medium and the electric terminal of GPU | |
CN103955398B (en) | Virtual machine coexisting scheduling method based on processor performance monitoring | |
CN108664378A (en) | A kind of most short optimization method for executing the time of micro services | |
CN104243617B (en) | Towards the method for scheduling task and system of mixed load in a kind of isomeric group | |
Basireddy et al. | AdaMD: Adaptive mapping and DVFS for energy-efficient heterogeneous multicores | |
CN107861606A (en) | A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping | |
CN110009233B (en) | Game theory-based task allocation method in crowd sensing | |
CN108845874A (en) | The dynamic allocation method and server of resource | |
CN106354729A (en) | Graph data handling method, device and system | |
CN107657599A (en) | Remote sensing image fusion system in parallel implementation method based on combination grain division and dynamic load balance | |
CN107315889A (en) | The performance test methods and storage medium of simulation engine | |
CN109697637A (en) | Object type determines method, apparatus, electronic equipment and computer storage medium | |
CN111860867B (en) | Model training method and system for hybrid heterogeneous system and related device | |
CN110347602A (en) | Multitask script execution and device, electronic equipment and readable storage medium storing program for executing | |
CN108900343A (en) | Local storage-based resource prediction and scheduling method for cloud server | |
CN106888156A (en) | A kind of method and device for playing reward distribution | |
CN104123119B (en) | Dynamic vision measurement feature point center quick positioning method based on GPU | |
Rak | Performance modeling using queueing Petri nets | |
CN113190342B (en) | Method and system architecture for multi-application fine-grained offloading of cloud-edge collaborative networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |