CN108228351A - Balancing performance dispatching method, storage medium and the electric terminal of GPU - Google Patents

Balancing performance dispatching method, storage medium and the electric terminal of GPU Download PDF

Info

Publication number
CN108228351A
CN108228351A CN201711460215.1A CN201711460215A CN108228351A CN 108228351 A CN108228351 A CN 108228351A CN 201711460215 A CN201711460215 A CN 201711460215A CN 108228351 A CN108228351 A CN 108228351A
Authority
CN
China
Prior art keywords
degree
gpu
performance
pressure
hydraulic performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711460215.1A
Other languages
Chinese (zh)
Other versions
CN108228351B (en
Inventor
过敏意
赵文益
陈�全
徐莉婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201711460215.1A priority Critical patent/CN108228351B/en
Publication of CN108228351A publication Critical patent/CN108228351A/en
Application granted granted Critical
Publication of CN108228351B publication Critical patent/CN108228351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention provides a kind of balancing performance dispatching method, storage medium and the electric terminal of GPU, the method includes:Statistical information and current stream handle cluster allocation plan when collecting the cachings at different levels operation of each sharing application;By the pressure that each sharing application of pressure extractor extraction is afforded on L2 cache and memory bandwidth during trained operation;The pressure of statistical information and the sharing application is as input during the operation that will be collected into, the conflict hydraulic performance decline degree of the sharing application is exported by trained conflict hydraulic performance decline fallout predictor prediction, the expansion hydraulic performance decline degree of the sharing application is exported by trained expansion hydraulic performance decline fallout predictor prediction;According to the conflict hydraulic performance decline degree of the sharing application and hydraulic performance decline degree is expanded, obtain the unbalanced degree of the performance of GPU and determines to redistribute the brand new allocation plan of stream handle of stream handle cluster according to unbalanced degree.Present invention can ensure that between sharing application hydraulic performance decline degree equilibrium.

Description

Balancing performance dispatching method, storage medium and the electric terminal of GPU
Technical field
The present invention relates to processor technical fields, and more particularly to GPU technical fields, specially a kind of performance of GPU is equal Weigh dispatching method, storage medium and electric terminal.
Background technology
With a large amount of deployment of the compute-intensive applications such as such as speech recognition, machine translation, personal personal assistant, mainstream Private data center or public cloud platform have begun largely to cope with traditional CPU calculating energy using as coprocessors such as GPU Hypodynamic problem.GPU is most initially the application specific processor designed for graph image calculating, and since it has tradition CPU The concurrency that can not be equal to, more and more non-graphic image applications move to GPU platform to meet in terms of it increases rapidly Calculation demand.But research shows that the application of non-graphic image does not often have enough degree of parallelisms so that the hardware of GPU to be made full use of to provide Source, so as to cause the waste of hardware resource.On the other hand, due to the development of GPU architecture and technique, multiprocessing is more and more flowed Device (Streaming Multiprocessor, SM) is integrated among one piece of GPU so that more prominent the problem of the wasting of resources Go out.
For this purpose, spatial parallelism, that is, multiple applications run share one piece of GPU simultaneously, it is suggested to solve the problems, such as appeal.Phase Close research shows that, the resource utilization of GPU and the overall performance of system can be greatly improved using spatial parallelism.It is answered when multiple During with shared GPU, they can compete 1) stream multiprocessor each other, 2) two level shared buffer memory and 3) global memory's bandwidth.Cause This, the performance when the performance of each application in the case of shared GPU monopolizes monoblock GPU compared to it can all decline.Together When often there is different functional improvement and for shared L2 cache and global memory's bandwidth contention due to different applications Susceptibility, so the scheduling scheme that multiprocessor is shunted between each application employed in tradition does not ensure that respectively There is different performance to decline degree for the fairness of performance between the application of a shared GPU, i.e., each application.
For the cloud platform of multi-tenant, ensureing the fairness of performance between each sharing application has important meaning Justice.If the fairness of performance can not be guaranteed between each sharing application, according to the relevant theory of game theory, platform user Tend to resist with other users share one piece of GPU so that by using spatial parallelism with promoted GPU resource utilization rate and The chance of systematic entirety energy is greatly decreased, and is unfavorable for and other platform competitions.Therefore spatial parallelism skill upgrading is being used to provide On the basis of source utilization rate and systematic entirety energy, ensureing the fairness of performance between each sharing application has important meaning Justice.
Invention content
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of scheduling of the balancing performance of GPU Method, storage medium and electric terminal, by the prediction of accurate hydraulic performance decline degree and dynamic SM allocation schedules, to ensure On the basis of spatial parallelism skill upgrading resource utilization and systematic entirety energy is used, at the same ensure each sharing application it Between performance fairness.
To achieve the above object and his related purpose, a kind of balancing performance dispatching method of GPU of the present invention, the performance are equal Weighing apparatus dispatching method includes:Statistical information and the distribution of current stream handle cluster when collecting the cachings at different levels operation of each sharing application Scheme;Each sharing application is extracted by pressure extractor during trained operation to be afforded on L2 cache and memory bandwidth Pressure;In the various shared resources that statistical information during the operation of the sharing application being collected into and the sharing application are afforded Pressure as input, the conflict hydraulic performance decline journey of the sharing application is exported by trained conflict hydraulic performance decline fallout predictor prediction Degree is exported the expansion hydraulic performance decline degree of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction;According to prediction The conflict hydraulic performance decline degree of the sharing application of output and expansion hydraulic performance decline degree obtain the unbalanced degree of the performance of GPU And determine to redistribute the brand new allocation plan of stream handle of stream handle cluster according to the unbalanced degree.
In one embodiment of the invention, the training process of pressure extractor includes when training the operation:It separately designs For L2 cache and multiple pressure measurement programs of memory bandwidth;It separately designs for the multiple of L2 cache and memory bandwidth Pressure generator;Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and collect corresponding Statistical information and the pressure value generated in L2 cache and memory bandwidth is measured during operation;The statistical information during operation that will be collected into As input, obtained pressure value will be measured as output, the preset neural network of training, pressure is extracted when forming the operation Device.
In one embodiment of the invention, the training conflict hydraulic performance decline fallout predictor and the training expansion hydraulic performance decline The training process of fallout predictor includes:Choose multiple application programs;Enable multiple application programs and multiple pressure generators Statistical information when shared GPU runs and collects corresponding operation measures the pressure value generated in L2 cache and memory bandwidth, answers With the conflict hydraulic performance decline degree of program and the expansion hydraulic performance decline degree of application program;Letter is counted during the operation that will be collected into The pressure value that breath and measurement obtain is as input, and using the conflict hydraulic performance decline degree of application program as exporting, training is preset Neural network forms the conflict hydraulic performance decline fallout predictor;The pressure that statistical information and measurement obtain during the operation that will be collected into Value is as input, and using the expansion hydraulic performance decline degree of application program as output, the preset neural network of training forms the punching Prominent hydraulic performance decline fallout predictor.
In one embodiment of the invention, statistical information will measure obtained pressure as input during the operation that will be collected into Force value is as output, and the preset neural network of training, pressure extractor specifically includes when forming the operation:The fortune that will be collected into Statistical information will measure obtained L2 cache pressure value as output, the preset neural network of training, shape as input during row Into L2 cache pressure extractor;Statistical information is as input during the operation that will be collected into, the memory bandwidth pressure that measurement is obtained Force value forms memory bandwidth pressure extractor as output, the preset neural network of training.
In one embodiment of the invention, the L2 cache pressure extractor and the memory bandwidth pressure extractor make The neural network includes an input layer, two hidden layers and an output layer;Wherein, the hidden layer neuron Number is equal to the quantity of input;The neural network intensifies function using LeakyRelu functions.
In one embodiment of the invention, the conflict hydraulic performance decline degree is i.e. in the fixed number of situation of stream handle cluster Under, application program is in the decline journey there are performance when being competed on L2 cache and memory bandwidth relative to performance when not competing Degree;For the expansion hydraulic performance decline degree i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, one should With the decline degree of performance of the program when performance during given number stream handle cluster is used to monopolize monoblock GPU relative to it.
In one embodiment of the invention, the unbalanced degree for obtaining the performance of GPU specifically includes:It is described to be exported according to prediction The sharing application conflict hydraulic performance decline degree and expand hydraulic performance decline degree and obtain the actual performance of sharing application and decline journey Degree, and the unbalanced degree of the performance according to actual performance decline degree acquisition GPU;Wherein, the actual performance declines journey Degree is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.
It is described to determine to redistribute at the stream of stream handle cluster according to the unbalanced degree in one embodiment of the invention The reason brand new allocation plan information of device specifically includes:If it is unbalanced degree be more than setting threshold value, redistributed, into When row is redistributed, decline the application of degree minimum and the application of maximum using current performance is redistributed every time with preset algorithm 1 stream handle cluster carry out the unbalanced degree of reduction gradually, when the distance of new allocation plan and initial allocation plan is more than During specific threshold, it is the brand new allocation plan of stream handle to determine current new allocation plan.
The embodiment of the present invention also provides a kind of storage medium, including GPU processors and memory, the memory storage There is program instruction, the GPU processors operation program instruction realizes method as described above.
The embodiment of the present invention also provides a kind of electric terminal, including GPU processors and memory, the memory storage There is program instruction, the GPU processors operation program instruction realizes method as described above.
As described above, the present invention a kind of GPU balancing performance dispatching method, storage medium and electric terminal, have with Lower advantageous effect:
The present invention provides a set of balancing performance scheduling mechanisms that multitask GPU is shared towards preemptive type.The mechanism can be with Under the premise of hardware supported is not increased, it is ensured that the base of GPU resource utilization rate and systematic entirety energy is promoted using spatial parallelism On plinth, it is further ensured that the equilibrium of hydraulic performance decline degree between sharing application, achievement of the invention can be used directly in multi-tenant Publicly-owned cloud environment in, to ensure the fairness of performance between each shared user.
Description of the drawings
Fig. 1 is shown as a kind of flow diagram of the balancing performance dispatching method of GPU of the present invention.
Fig. 2 is shown as a kind of architecture diagram of the balancing performance dispatching method of GPU of the present invention.
Fig. 3 is shown as a kind of hardware system structure figure of the balancing performance dispatching method application of GPU of the present invention.
Fig. 4 is shown as the flow diagram of the balancing performance dispatching method off-line training step of the GPU of the present invention.
Fig. 5 is shown as the flow diagram in the balancing performance dispatching method on-line scheduling stage of the GPU of the present invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from Various modifications or alterations are carried out under the spirit of the present invention.
The purpose of the present embodiment is that providing balancing performance dispatching method, storage medium and the electric terminal of a kind of GPU, lead to Accurate hydraulic performance decline degree prediction and dynamic SM allocation schedules are crossed, to ensure using spatial parallelism skill upgrading resource On the basis of utilization rate and systematic entirety energy, while ensure the fairness of performance between each sharing application.
The balancing performance dispatching method of GPU of the present invention described in detail below a kind of, storage medium and electric terminal Principle and embodiment makes those skilled in the art not need to the performance that creative work is appreciated that a kind of GPU of the present invention Equalization scheduling method, storage medium and electric terminal.
Specifically, the present embodiment aims at the balancing performance that multitask GPU is shared based on preemptive type of a low overhead Scheduling mechanism on the basis of resource utilization and systematic entirety energy is promoted, ensures the public affairs of performance between each sharing application Levelling.
Balancing performance dispatching method, storage medium and the electric terminal of the GPU of the present embodiment are described in detail below.
As shown in Figure 1, the present embodiment provides the balancing performance dispatching method of GPU a kind of, the balancing performance scheduling of the GPU Method includes the following steps:
Step S110, statistical information and current stream handle cluster point when collecting the cachings at different levels operation of each sharing application With scheme;
Step S120 extracts each sharing application in L2 cache and memory band by pressure extractor during trained operation The pressure afforded on width;
Step S130 is afforded statistical information during the operation of the sharing application being collected into and the sharing application various Pressure in shared resource is exported the conflict of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input Hydraulic performance decline degree is exported the expansion hydraulic performance decline journey of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction Degree;
Step S140 according to the conflict hydraulic performance decline degree of the sharing application of prediction output and expands hydraulic performance decline journey Degree obtains the unbalanced degree of the performance of GPU and determines to redistribute the stream handle of stream handle cluster according to the unbalanced degree Brand new allocation plan.
The balancing performance dispatching method of the GPU of the present embodiment is described in detail below.
In the present embodiment GPU balancing performance dispatching method application bottom hardware be one support SM ranks seize it is more Task GPU.As shown in Fig. 2, a GPU usually by several SM (streaming multiprocessor, stream handle cluster, Also cry GPU big cores) it forms, each SM possesses the level cache of oneself private, and SM used shares a two level shared buffer memory.When Two application App-a and App-b are when supporting to run on the multitask GPU that seize for one, using can compete SM each other, two Grade shared buffer memory and global memory's bandwidth.
As shown in figure 3, it is the software for the balancing performance scheduling mechanism that the present invention entirely shares multitask GPU based on preemptive type Organization Chart.The software architecture of the runtime system is divided into four layers:Information extraction layer, pressure extract layer, performance prediction layer during operation With allocation schedule layer.
Step S110, statistical information and current stream handle cluster point when collecting the cachings at different levels operation of each sharing application With scheme.
Statistical information and current when information extraction layer collects the cachings at different levels operation of each sharing application when passing through operation Stream handle cluster allocation plan.Information extraction layer is used for the cachings at different levels provided for statistical module on GPU chips during operation Information and the allocation plan of current SM extract.These message reflections feature of each sharing application, subsequent place Manage these information that place one's entire reliance upon.
Step S120 extracts each sharing application in L2 cache and memory band by pressure extractor during trained operation The pressure afforded on width.
Pressure is used to be reflected in the severity competed in shared resource.
In this present embodiment, as shown in figure 4, the training process of pressure extractor includes when training the operation:It sets respectively Meter is for L2 cache and multiple pressure measurement programs of memory bandwidth;It separately designs for the more of L2 cache and memory bandwidth A pressure generator;Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and collect corresponding Operation when statistical information and measure the pressure value generated in L2 cache and memory bandwidth;Letter is counted during the operation that will be collected into Breath will measure obtained pressure value as output, the preset neural network of training, pressure carries when forming the operation as input Take device.
Quantitative analysis is carried out to pressure, is had for each shared resource (two level shared buffer memory or memory bandwidth) One exclusive pressure test program.When an application program and pressure test procedure sharing GPU, the application program is in correspondence Shared resource on the pressure that generates be defined as the decline degree of corresponding pressure test program feature (relative to exclusive monoblock Performance during GPU).One is applied the pressure born in a kind of shared resource to be defined as other application in the shared resource The pressure of generation.
In this present embodiment, statistical information is as input during the operation that will be collected into, will measure obtained pressure value as Output, the preset neural network of training, pressure extractor specifically includes when forming the operation:It is counted during the operation that will be collected into Information will measure obtained L2 cache pressure value as output, the preset neural network of training forms two level and delays as input Deposit pressure extractor;Statistical information is as input during the operation that will be collected into, will measure obtained memory bandwidth pressure value as Output, the preset neural network of training, forms memory bandwidth pressure extractor.
That is, pressure extract layer includes the extraction of two level shared buffer memory pressure and the extraction of memory bandwidth pressure.
The extraction of two level shared buffer memory pressure is responsible for extracting each application program at runtime and bearing in two level shared buffer memory The pressure arrived.At runtime, it is carried by the good neural network of an off-line training to complete the real-time of pressure on L2 cache It takes.Extract layer collected information when the input of the neural network is operation.In the stage of training neural network, pass through acquisition Information and will during operation when a series of L2 cache pressure generator and two level shared buffer memory pressometer share GPU For information as input, the pressure value that pressometer is measured completes the training of neural network as label during these operations.Pressure Forcer is a kind of more enough application programs that particular size pressure is generated in corresponding shared resource.
The extraction of memory bandwidth pressure is responsible for extracting the pressure that each application program affords in memory bandwidth at runtime. At runtime, the extract real-time of the pressure in memory bandwidth is completed by the good neural network of an off-line training.The nerve Extract layer collected information when the input of network is operation.Training neural network stage, it is a series of by acquiring Information and information work when these are run during operation when memory bandwidth generator and memory bandwidth pressometer share GPU For input, the pressure value that pressometer is measured completes the training of neural network as label.
In this present embodiment, the L2 cache pressure extractor and the memory bandwidth pressure extractor use described Neural network includes an input layer, two hidden layers and an output layer;Wherein, the number of the hidden layer neuron is equal to The quantity of input;The neural network intensifies function using LeakyRelu functions.
Step S130 is afforded statistical information during the operation of the sharing application being collected into and the sharing application various Pressure in shared resource is exported the conflict of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input Hydraulic performance decline degree is exported the expansion hydraulic performance decline journey of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction Degree.
In this present embodiment, the conflict hydraulic performance decline degree is, application fixed number of in stream handle cluster Program is in the decline degree there are performance when being competed on L2 cache and memory bandwidth relative to performance when not competing;It is described Hydraulic performance decline degree is expanded i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, an application program exists The decline degree of performance when performance during using given number stream handle cluster monopolizes monoblock GPU relative to it.
As shown in figure 4, in this present embodiment, under the training conflict hydraulic performance decline fallout predictor and the training expansion performance The training process of drop fallout predictor includes:
Choose multiple application programs;Multiple application programs is enabled to share GPU operations simultaneously with multiple pressure generators Statistical information when collecting corresponding operation measures the conflict of pressure value, application program generated in L2 cache and memory bandwidth The expansion hydraulic performance decline degree of hydraulic performance decline degree and application program;Statistical information and measurement obtain during the operation that will be collected into Pressure value as input, using the conflict hydraulic performance decline degree of application program as exporting, the preset neural network of training, formation The conflict hydraulic performance decline fallout predictor;The pressure value that statistical information and measurement obtain during the operation that will be collected into, will as input It is pre- to form the conflict hydraulic performance decline as output, the preset neural network of training for the expansion hydraulic performance decline degree of application program Survey device.
That is, in this present embodiment, performance prediction layer includes conflict performance prediction and expands performance prediction.
1) conflict performance prediction is responsible for predicting the conflict hydraulic performance decline degree of application program.The hydraulic performance decline degree that conflicts exists In the case of SM is fixed number of, there are on L2 cache and memory bandwidth compete when performance relative to performance when not competing Decline degree.At runtime, the prediction of conflict hydraulic performance decline is completed by the good neural network of an off-line training.The god What the collected information of extract layer and pressure extract layer exported when the input through network is operation is on L2 cache and interior Deposit the pressure in bandwidth.In the stage of training neural network, some are collected first and establishment one possesses extensive representative instruction Practice collection.By information during operation when acquiring the Application share GPU in a series of pressure generator and training set and by this The corresponding pressure value that information and training application afford during a little operations is as input, the practical conflict hydraulic performance decline of training application Degree completes the training of neural network as label.
2) performance prediction is expanded to be responsible for predicting the expansion hydraulic performance decline degree of application program and expand performance variation degree.It opens up Hydraulic performance decline degree is opened up i.e. in the case of completely without the competition in shared buffer memory and memory bandwidth, applies for one and is using spy The decline degree of performance when performance during fixed number mesh SM monopolizes monoblock GPU relative to it.Performance variation degree is expanded not have When using specific SM numbers in the case of having the competition of L2 cache and memory bandwidth, brought by increasing or decreasing a SM Hydraulic performance decline degree variation.At runtime, conflict hydraulic performance decline is completed by the good neural network of an off-line training The prediction of degree and conflict performance variation degree.The input of the neural network for operation when the collected information of extract layer and The pressure on L2 cache and in memory bandwidth of pressure extract layer output.In the stage of training neural network, collect first Some simultaneously set up one just with extensive representative training set.It is applied by acquiring each training in different SM distribution conditions Under, it shares information when GPU is institute's collected operation with each pressure generator and affords these information and training application Corresponding pressure value as input, the practical expansion hydraulic performance decline degree of training application and expand performance variation degree as mark Label complete the training of neural network.
Wherein, the neural network that conflict performance prediction and expansion performance prediction use is similar to used in pressure extract layer Neural network.
Step S140 according to the conflict hydraulic performance decline degree of the sharing application of prediction output and expands hydraulic performance decline journey Degree obtains the unbalanced degree of the performance of GPU and determines to redistribute the stream handle of stream handle cluster according to the unbalanced degree Brand new allocation plan.
In this present embodiment, the unbalanced degree for obtaining the performance of GPU specifically includes:It is described according to prediction output this is shared The conflict hydraulic performance decline degree of application and the actual performance decline degree for expanding hydraulic performance decline degree acquisition sharing application, and according to The actual performance declines the unbalanced degree that degree obtains the performance of GPU;Wherein, the actual performance declines degree and is equal to accordingly Conflict hydraulic performance decline degree with expand hydraulic performance decline degree product.
Specifically, in this present embodiment, it is described to determine to redistribute at the stream of stream handle cluster according to the unbalanced degree The reason brand new allocation plan information of device specifically includes:If it is unbalanced degree be more than setting threshold value, redistributed, into When row is redistributed, decline the application of degree minimum and the application of maximum using current performance is redistributed every time with preset algorithm 1 stream handle cluster carry out the unbalanced degree of reduction gradually, when the distance of new allocation plan and initial allocation plan is more than During specific threshold, it is the brand new allocation plan of stream handle to determine current new allocation plan.
Allocation schedule layer is responsible for according to the true of the foreseeable output prediction application program of conflict performance prediction and expansion Hydraulic performance decline degree.It is the competition on there are L2 cache and in memory bandwidth that actual performance, which declines degree, using current The decline degree of performance when performance during the SM numbers of distribution monopolizes monoblock GPU relative to it.It is defined it is found that actual performance Decline degree is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.Decline in prediction actual performance On the basis of degree, distributed to by using the adjustment of a didactic greedy algorithm gradually each application SM number with Reduce the unbalanced degree of hydraulic performance decline degree between each application.The unbalanced degree of wherein hydraulic performance decline degree is defined as each common Enjoy the difference between the maximum value and minimum value of the hydraulic performance decline degree of application.Specific dispatching algorithm is as shown in table 1 below.
Table 1
The algorithm is periodically called.It first determines whether that whether unbalanced degree is already less than specified under current distribution Threshold value, if it is algorithm directly terminate.If current unbalanced degree is more than the threshold value specified, it is by using greed The reduction that 1 SM that method redistributes application and maximum application that current performance declines degree minimum every time comes gradually is uneven Weighing apparatus degree.The distance between two groups of distribution are defined as each maximum value using SM number change amounts.When new allocation plan with When the distance of initial allocation plan is more than specific threshold, algorithm terminates immediately.This allows for the accuracy of predicted value with dividing Deviate the distance dependent with original scheme with scheme.
To sum up, in the use of the present invention, first carrying out off-line training, as shown in figure 4, training flow is:
1) according to the corresponding pressure measurement program of the architecture design of target GPU:Separately design for two level shared buffer memory and The pressure measurement program of global memory's bandwidth, to quantify the severity competed in various shared resources.
2) according to the corresponding design pressure generator of the architecture design of target GPU:It separately designs for L2 cache and complete The pressure generator of office's memory bandwidth, for generating the pressure of particular size in specific shared resource.
3) pressure measurement program shares operation with pressure generator:Allow various pressure measurement programs and various pressure generators Shared GPU operations, information and the pressure value generated when collecting corresponding operation.
4) pressure extractor during training operation:Information will be measured as input during the operation that will be collected on last stage Obtained pressure value is extracted as output one neural network of training for online pressure.
5) one group of application program with adequate representation is collected:One group of application program is collected, fully covers mainstream Various situations collect representative application program according to the application scenarios of target and form training set.
6) application program shares operation with pressure generator:Application program is allowed to share operation with pressure generator, collects fortune Statistical information during row, the conflict hydraulic performance decline of application program expand hydraulic performance decline, the pressure that pressure generator generates.
7) training conflict hydraulic performance decline fallout predictor:The information during operation of middle collection on last stage and pressure generator are generated Pressure as input, using the conflict hydraulic performance decline of application program as one neural network of label training, rushed for online Prominent hydraulic performance decline prediction.
8) hydraulic performance decline fallout predictor is expanded in training:The information during operation of middle collection on last stage and pressure generator are generated Pressure as input, using the expansion hydraulic performance decline of application program as one neural network of label training, opened up for online Open up hydraulic performance decline prediction.
After the completion of off-line training, you can carry out on-line scheduling, on-line scheduling flow is as shown in Figure 5:
On-line scheduling flow:
1) information is collected when running:Statistical information and current when collecting the cachings at different levels operation of each sharing application SM allocation plans.
2) pressure is extracted:Collected information is extracted each sharing application and is delayed in two level as input in inciting somebody to action on last stage The pressure deposited and afforded in memory bandwidth.
3) conflict hydraulic performance decline prediction:By information during the operation of the sharing application being collected into and its afforded various common The pressure in resource is enjoyed as input, the conflict hydraulic performance decline degree of the application is exported by conflict hydraulic performance decline fallout predictor.
4) hydraulic performance decline prediction is expanded:By information during the operation of the sharing application being collected into and its afforded various common The pressure in resource is enjoyed as input, the expansion hydraulic performance decline degree of the application is exported by expansion hydraulic performance decline fallout predictor.
5) SM is redistributed:The prediction of conflict hydraulic performance decline and expansion performance prediction obtained according to prediction, is calculated one The actual performance of a sharing application declines degree, and further calculates the unbalanced degree of the performance of system accordingly.It is if uneven Weighing apparatus degree has been more than specific threshold value, then is redistributed.Non- timing again is being carried out, first using the method for greed gradually The unbalanced degree of current system is reduced, and is thus obtained to the current more preferably allocation plan of a ratio, is redistributed.
The embodiment of the present invention also provides a kind of storage medium, including GPU processors and memory, the memory storage There is program instruction, the GPU processors operation program instruction realizes the balancing performance dispatching method of GPU as described above.It is above-mentioned The balancing performance dispatching method of GPU is described in detail, details are not described herein.
The embodiment of the present invention also provides a kind of electric terminal, and the electric terminal is, for example, server, and the electronics is whole End includes GPU processors and memory, and the memory has program stored therein instruction, and the GPU processors run program instruction reality The now balancing performance dispatching method of GPU as described above.It is above-mentioned that the balancing performance dispatching method of GPU has been carried out specifically Bright, details are not described herein.
In conclusion the present invention provides a set of balancing performance scheduling mechanisms that multitask GPU is shared towards preemptive type.It should Mechanism can be under the premise of hardware supported not be increased, it is ensured that promotes GPU resource utilization rate using spatial parallelism and system is whole On the basis of performance, it is further ensured that the equilibrium of hydraulic performance decline degree between sharing application, achievement of the invention can be used directly In the publicly-owned cloud environment of multi-tenant, to ensure the fairness of performance between each shared user.So the present invention effectively overcomes Various shortcoming of the prior art and have high industrial utilization.
The principle of the present invention and effect is only illustrated in above-described embodiment, and is not intended to limit the present invention.It is any to be familiar with The personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Therefore, Such as those of ordinary skill in the art without departing from disclosed spirit with being completed under technological thought All equivalent modifications or change, should by the present invention claim be covered.

Claims (10)

1. the balancing performance dispatching method of a kind of GPU, which is characterized in that the balancing performance dispatching method includes:
Statistical information and current stream handle cluster allocation plan when collecting the cachings at different levels operation of each sharing application;
Each sharing application is extracted by pressure extractor during trained operation to be afforded on L2 cache and memory bandwidth Pressure;
In the various shared resources that statistical information during the operation of the sharing application being collected into and the sharing application are afforded Pressure is exported the conflict hydraulic performance decline journey of the sharing application by trained conflict hydraulic performance decline fallout predictor prediction as input Degree is exported the expansion hydraulic performance decline degree of the sharing application by trained expansion hydraulic performance decline fallout predictor prediction;
According to the conflict hydraulic performance decline degree of the sharing application of prediction output and hydraulic performance decline degree is expanded, obtains the property of GPU The unbalanced degree of energy and the brand new allocation plan of stream handle for determining to redistribute stream handle cluster according to the unbalanced degree.
2. the balancing performance dispatching method of GPU according to claim 1, which is characterized in that pressure during the training operation The training process of extractor includes:
Separately design multiple pressure measurement programs for L2 cache and memory bandwidth;
Separately design multiple pressure generators for L2 cache and memory bandwidth;
Multiple pressure measurement programs and multiple pressure generators is enabled to share GPU to run and unite when collecting corresponding operation Meter information simultaneously measures the pressure value generated in L2 cache and memory bandwidth;
Statistical information will measure obtained pressure value as output, the preset god of training as input during the operation that will be collected into Through network, pressure extractor when forming the operation.
3. the balancing performance dispatching method of GPU according to claim 2, which is characterized in that under the training conflict performance Drop fallout predictor and the training training process for expanding hydraulic performance decline fallout predictor include:
Choose multiple application programs;
Multiple application programs and multiple pressure generators is enabled to share statistics when GPU runs and collects corresponding operation to believe Breath measures the pressure value generated in L2 cache and memory bandwidth, the conflict hydraulic performance decline degree of application program and using journey The expansion hydraulic performance decline degree of sequence;
The pressure value that statistical information and measurement obtain during the operation that will be collected into, will be under the conflict performance of application program as input Drop degree forms the conflict hydraulic performance decline fallout predictor as output, the preset neural network of training;
The pressure value that statistical information and measurement obtain during the operation that will be collected into, will be under the expansion performance of application program as input Drop degree forms the conflict hydraulic performance decline fallout predictor as output, the preset neural network of training.
4. the balancing performance dispatching method of GPU according to claim 2, which is characterized in that unite during the operation that will be collected into Information is counted as input, obtained pressure value will be measured as output, training preset neural network is pressed when forming the operation Power extractor specifically includes:
Statistical information will measure obtained L2 cache pressure value as output, training as input during the operation that will be collected into Preset neural network forms L2 cache pressure extractor;
Statistical information will measure obtained memory bandwidth pressure value as output, training as input during the operation that will be collected into Preset neural network forms memory bandwidth pressure extractor.
5. the balancing performance dispatching method of GPU according to claim 4, which is characterized in that the L2 cache pressure carries The neural network that device and the memory bandwidth pressure extractor use is taken to include an input layer, two hidden layers and one Output layer;Wherein, the number of the hidden layer neuron is equal to the quantity of input;The neural network intensifies what function used It is LeakyRelu functions.
6. the balancing performance dispatching method of the GPU according to claim 1 or 3, which is characterized in that the conflict hydraulic performance decline Degree be it is fixed number of in stream handle cluster, application program there are on L2 cache and memory bandwidth compete when Performance relative to performance when not competing decline degree;The expansion hydraulic performance decline degree i.e. completely without shared buffer memory and In the case of competition in memory bandwidth, an application program is using performance during given number stream handle cluster relative to it The decline degree of performance during exclusive monoblock GPU.
7. the balancing performance dispatching method of GPU according to claim 1, which is characterized in that obtain the unevenness of the performance of GPU Weighing apparatus degree specifically includes:
The conflict hydraulic performance decline degree and expansion hydraulic performance decline degree of the sharing application according to prediction output obtain shared The actual performance of application declines degree, and the unbalanced degree of the performance according to actual performance decline degree acquisition GPU;Its In, the actual performance declines degree and is equal to product of the corresponding conflict hydraulic performance decline degree with expanding hydraulic performance decline degree.
8. the balancing performance dispatching method of the GPU according to claim 1 or 7, which is characterized in that described according to the unevenness Weighing apparatus degree determines that the brand new allocation plan information of the stream handle for redistributing stream handle cluster specifically includes:
If unbalanced degree has been more than the threshold value of setting, redistributed, when being redistributed, imputed using with pre- Method redistributes current performance and declines 1 stream handle cluster of the application of degree minimum and the application of maximum come subtracting gradually every time Few unbalanced degree when the distance of new allocation plan and initial allocation plan is more than specific threshold, determines current new Allocation plan is the brand new allocation plan of stream handle.
9. a kind of storage medium, including GPU processors and memory, the memory has program stored therein instruction, the GPU processing Device operation program instruction is realized such as claim 1 to claim 8 any one of them method.
10. a kind of electric terminal, including GPU processors and memory, the memory has program stored therein instruction, at the GPU Device operation program instruction is managed to realize such as claim 1 to claim 8 any one of them method.
CN201711460215.1A 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal Active CN108228351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711460215.1A CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711460215.1A CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Publications (2)

Publication Number Publication Date
CN108228351A true CN108228351A (en) 2018-06-29
CN108228351B CN108228351B (en) 2021-07-27

Family

ID=62646577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711460215.1A Active CN108228351B (en) 2017-12-28 2017-12-28 GPU performance balance scheduling method, storage medium and electronic terminal

Country Status (1)

Country Link
CN (1) CN108228351B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020056620A1 (en) * 2018-09-19 2020-03-26 Intel Corporation Hybrid virtual gpu co-scheduling
CN110929627A (en) * 2019-11-18 2020-03-27 北京大学 Image recognition method of efficient GPU training model based on wide-model sparse data set
CN117762654A (en) * 2023-12-22 2024-03-26 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment and storage medium for collecting GPU information by application program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461928A (en) * 2013-09-16 2015-03-25 华为技术有限公司 Method and device for dividing caches
CN105487927A (en) * 2014-09-15 2016-04-13 华为技术有限公司 Resource management method and device
CN106383792A (en) * 2016-09-20 2017-02-08 北京工业大学 Missing perception-based heterogeneous multi-core cache replacement method
US20170352120A1 (en) * 2007-07-13 2017-12-07 Cerner Innovation, Inc. Claim processing validation system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170352120A1 (en) * 2007-07-13 2017-12-07 Cerner Innovation, Inc. Claim processing validation system
CN104461928A (en) * 2013-09-16 2015-03-25 华为技术有限公司 Method and device for dividing caches
CN105487927A (en) * 2014-09-15 2016-04-13 华为技术有限公司 Resource management method and device
CN106383792A (en) * 2016-09-20 2017-02-08 北京工业大学 Missing perception-based heterogeneous multi-core cache replacement method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RODRIGO ESCOBAR: "performace prediction of parallel applications based on small-scale executions", 《IEEE》 *
SIQI WANG: "CGPredict:Embedded GPU performance estimation from single-threaded application", 《ACM》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020056620A1 (en) * 2018-09-19 2020-03-26 Intel Corporation Hybrid virtual gpu co-scheduling
US11900157B2 (en) 2018-09-19 2024-02-13 Intel Corporation Hybrid virtual GPU co-scheduling
CN110929627A (en) * 2019-11-18 2020-03-27 北京大学 Image recognition method of efficient GPU training model based on wide-model sparse data set
CN110929627B (en) * 2019-11-18 2021-12-28 北京大学 Image recognition method of efficient GPU training model based on wide-model sparse data set
CN117762654A (en) * 2023-12-22 2024-03-26 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment and storage medium for collecting GPU information by application program

Also Published As

Publication number Publication date
CN108228351B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
Hung et al. Wide-area analytics with multiple resources
CN106776005A (en) A kind of resource management system and method towards containerization application
Peddi et al. An intelligent cloud-based data processing broker for mobile e-health multimedia applications
KR20210082210A (en) Creating an Integrated Circuit Floor Plan Using Neural Networks
CN106445681B (en) Distributed task dispatching system and method
CN111274036B (en) Scheduling method of deep learning task based on speed prediction
CN108228351A (en) Balancing performance dispatching method, storage medium and the electric terminal of GPU
CN103955398B (en) Virtual machine coexisting scheduling method based on processor performance monitoring
CN108664378A (en) A kind of most short optimization method for executing the time of micro services
CN104243617B (en) Towards the method for scheduling task and system of mixed load in a kind of isomeric group
Basireddy et al. AdaMD: Adaptive mapping and DVFS for energy-efficient heterogeneous multicores
CN107861606A (en) A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping
CN110009233B (en) Game theory-based task allocation method in crowd sensing
CN108845874A (en) The dynamic allocation method and server of resource
CN106354729A (en) Graph data handling method, device and system
CN107657599A (en) Remote sensing image fusion system in parallel implementation method based on combination grain division and dynamic load balance
CN107315889A (en) The performance test methods and storage medium of simulation engine
CN109697637A (en) Object type determines method, apparatus, electronic equipment and computer storage medium
CN111860867B (en) Model training method and system for hybrid heterogeneous system and related device
CN110347602A (en) Multitask script execution and device, electronic equipment and readable storage medium storing program for executing
CN108900343A (en) Local storage-based resource prediction and scheduling method for cloud server
CN106888156A (en) A kind of method and device for playing reward distribution
CN104123119B (en) Dynamic vision measurement feature point center quick positioning method based on GPU
Rak Performance modeling using queueing Petri nets
CN113190342B (en) Method and system architecture for multi-application fine-grained offloading of cloud-edge collaborative networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant