CN106293003A

CN106293003A - A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query

Info

Publication number: CN106293003A
Application number: CN201610638736.0A
Authority: CN
Inventors: 王卓薇; 程良伦
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2017-01-04

Abstract

The invention discloses a kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query, dynamic power consumption optimization problem is described as by CUDA program on heterogeneous system perform process prescription be that a kind of abstract data represent AOV network, and critical path of based on AOV Solution To The Network program, find the non-critical task that can carry out DVFS frequency reducing optimization on the premise of not affecting the program always execution time, solve the frequency amplitude of accommodation of each non-critical task under energetic optimum target.The energy-optimised problem of CUDA program can be changed into mathematical programming problem based on AOV net by the method effectively, thus provides the optimisation strategy of energetic optimum under limited performance premise.

Description

A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query

Technical field

The present invention relates to heterogeneous system low-power consumption field, more specifically, design is towards the dynamic merit of CPU-GPU heterogeneous system Consumption optimization problem.

Background technology

Of the same trade more existing methods reducing heterogeneous system dynamic power consumptions, existing GPU low-power consumption optimization be mostly for The power problems of single GPU task, seldom has the working needle power consumption to the CPU-GPU whole application program of heterogeneous system the characteristic study Optimize.But there is multiple dissimilar task in the application, simultaneously in CUDA programmed environment, host CPU is calling Being at idle condition after cudaThreadSynchronize (), this is actually to the waste calculating resource, although GPU's Computing capability is the most powerful, but is also the most time-consuming when processing large-scale dataset, can a part of distribution of computation tasks be given It is in the CPU process of idle condition, allows CPU Yu GPU concurrent working, the execution time of kernel function will certainly be reduced, and then Reduce the execution time of whole program.Therefore for a given task (there is no the loop iteration of dependence), how to exist Task division is carried out so that heterogeneous system, under conditions of meeting performance (energy) constraint, obtains energy (property between CPU and GPU Can) optimum is the design original intention of the new solution that we design.

Summary of the invention

The present invention proposes a kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query, in program One group of non-critical task of middle searching (collect at program operation process and do not affect the task of whole program execution time) also determines phase Answer CPU or GPU Frequency regulation factor so that when program is run in CPU-GPU heterogeneous system, perform time-preserving and energy Consumption optimum.

In order to solve the problems referred to above, the technical scheme is that

A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query, analyzes CUDA application program Operation characteristic on CPU-GPU heterogeneous system, concludes task dependence therein, comprises many GPU task by one complete The procedural representation that performs of program is a kind of data structure AOV net based on figure, analyzes the critical path that program is run on this basis Footpath, finds out and can carry out energy optimization part in program, solve the corresponding frequency amplitude of accommodation, and keeping, program feature is constant The overall power consumption of program is minimized under premise.Optimize to as if the dynamic merit of whole program that comprises multiple GPU task Consumption, specific as follows:

Step1: (CPU calculates task, communication task, GPU calculating times first to isolate different types of task from program Business), carry out list GPU task again being divided on CPU and GPU perform simultaneously.It is depending between task by the process prescription of operation The relation of relying, the AOV network of constructor running.

Step2: being secondly analyzed running AOV network, (be made up of mission critical is oriented to determine critical path Figure), then the CPU on non-critical path (directed graph being made up of non-critical task) and GPU task are for can carry out frequency regulation To save the non-critical task of power consumption.

Step3: the time performed finally according to mission critical determines the execution time range that non-critical task can be loosened, Thus solve each task processor frequency amplitude of accommodation to minimize the consumption of power consumption.

One typical CUDA program is as shown in Figure 1, it is assumed that single GPU task is now divided into 2 subtasks. Block1, block2, block4 represent CPU task incoherent with GPU task, single GPU task is divided into N number of son and appoints Business, the corresponding sub-Kernel function in each subtask, need to be called by main frame at the end of every sub-kernel function CudaThreadSynchronize () synchronizes, and block3_1-block3_N represents and divides a part of task by each subtask Transfer to CPU process.Specific practice is:

Step1: the task dependence in analysis program sets up program task dependency graph.One is not lost in order to simplify problem As property, it is assumed that single GPU task is divided into 2 subtasks, each subtask carry out again CPU Yu GPU task divide, its program Task dependency graph G=(V, E) (V represents task dependency graph interior joint, and E represents task dependence) is as shown in Figure 2.C generation in figure Table CPU calculates task, and T represents data and is transmitted between CPU and GPU, and G represents GPU and calculates task.

Step2: structure AOV network.Resource contention is there is, it then follows arbitration mechanism between concurrently performing of task: assuming that If a certain moment task flow exists multiple satisfied priority dependence but there is the task of resource contention, then prioritizing selection simultaneously The tasks carrying that task flow numbering is less.Set up the dependence between resource contention task.Such as in Fig. 2, when CPU task C1 After execution completes, the task of meeting task dependence includes T1, C2 and T2.Wherein T1 and T2 belongs to communication task, therefore There is resource contention to be unable to simultaneously perform, preferentially perform T1 according to arbitration mechanism, therefore should increase on the basis of artwork T1 with Task dependence between T2.Program dependence task figure expands to Fig. 3.

Step3: determine earliest start time and the late start time of each task in program.With EST (M_i) represent task M_i Earliest start time function, LST (M_i) represent task M_iLate Start function,<vj, vi>then represents that node j will be prior to joint Point i performs.

Earliest start time:

\{\begin{matrix} E S T (M_{1}) = 0 \\ E S T (M_{i}) = \max < v_{j}, v_{i} > {&Element;}_{E} {E S T (M_{j}) + T i m e (M_{j})} \end{matrix} - - - (1)

Late start time:

{\begin{matrix} L S T (M_{N}) = E S T (M_{N}) \\ L S T (M_{i}) = \min < v_{i}, v_{j} > {&Element;}_{E} {L S T (M_{j}) - T i m e (M_{i})} \end{matrix} - - - (2)

Therefore task M_iThe earliest with Starting Executing Time EST (M the latest_i) and LST (M_i) can be according to formula (14) and (15) Recursion draws.

4. critical path is judged.If the possible time started the earliest of task allow the time started the latest equal to it, then may be used To judge that this task is positioned in the critical path of AOV net, its operation time directly affects the operation time of whole program, it is impossible to enter Row loosens.If instead the possible time started the earliest of task allow the time started the latest less than it, then judge that this task is positioned at On the non-critical path of AOV net, this task can be carried out frequency regulation, suitably increase the operation time, reduce system dynamics energy Consumption.

5. dynamic energy consumption optimal problem is changed into N unit extreme-value problem solve.For non-key node set P_i, I Construct an AOV subnetAssuming that AOV subnet has N number of non-key CPU and GPU task node, it is designated asIts processor and memorizer original frequency are1≤j≤N, carries out after frequency regulation theirs Frequency becomes respectivelyThe calculating time of the most each task becomes The memory access time becomesThe time of other tasks is constant.We can be according to formula (15) calculateThe possible time started the earliest after regulation, it is designated as

WhenTime, increase within the specific limits and calculate operating time (T '_comp), reduce processor running frequency, be System dynamic energy consumption is attributed to N unit extreme-value problem:

\{\begin{matrix} \min Σ_{j = 1}^{N} k_{c} . f_{c}^{'} {(M_{j}^{i})}^{3} . \frac{N_{w a r p \cdot N_{B}}}{N} \cdot \frac{T_{c o m p (M_{j}^{i}) \cdot f (M_{j}^{i})}}{f_{c}^{'} (M_{j}^{i})} + k_{m} \cdot f_{m}^{3} \cdot \frac{N_{w a r p \cdot N_{B}}}{N} T_{m e m} \\ s . t . {EST}^{'} (M_{i + 1}^{c}) = L S T (M_{i + 1}^{c}) \\ T_{c o m p}^{'} \leq_{m e m^} T_{l a t} + \frac{T_{c o m p}^{'}}{N_{m e m}} \leq N u m_W a r p \frac{T_{m e m}}{N_{m e m}} \\ f_{c}^{'} \leq f_{c (\max)} \end{matrix} - - - (3)

WhenTime, increase storage operating time (T ' within the specific limits_mem), reduce memorizer memory access frequency, be System dynamic energy consumption is attributed to N unit extreme-value problem:

\{\begin{matrix} \min Σ_{i = 1}^{N} k_{c} . f_{c}^{3} . \frac{N_{w a r p \cdot N_{B}}}{N} \cdot T_{c o m p} + k_{m} \cdot f_{m}^{'} {(M_{i}^{j})}^{3} \cdot \frac{N_{w a r p \cdot N_{B}}}{N} \cdot \frac{T_{m e m} (M_{i}^{j}) \cdot f_{m} (M_{i}^{j})}{f_{m}^{'} (M_{i}^{j})} \\ s . t . {EST}^{'} (M_{i + 1}^{c}) = L S T (M_{i + 1}^{c}) \\ T_{m e m}^{'} \leq T_{c o m p^} T_{l a t} + \frac{T_{c o m p}}{N_{m e m}} &GreaterEqual; N u m_W a r p \frac{T_{m e m}^{'}}{N_{m e m}} \\ f_{m}^{'} \leq f_{m (\max)} \end{matrix} - - - (4)

Accompanying drawing explanation

Fig. 1 is the typical CUDA program schematic diagram of the present invention

Fig. 2 is the program task dependency graph of the present invention

Fig. 3 is the program AOV schematic diagram of the present invention

Fig. 4 is the analysis of cases figure of the present invention

Fig. 5 is EST and LST of each task of the present invention

Detailed description of the invention

The present invention will be further described with embodiment below in conjunction with the accompanying drawings.

The present embodiment implements process based on AOV gateway key path query dynamic energy consumption optimization method carry out a kind of Detailed description.Fig. 4 gives the analysis process of embodiment program.Wherein figure (a) is primal algorithm flow process, includes 6 steps altogether Rapid: being first the initialization to two arrays of a, b, then invoked procedure f1 and f2 is respectively to a and b process, obtains array c And d.4th row representative function f3 is calculated array e by a scalar ce；5th row then representative function f4 is by two arrays of a, b Calculate scalar result β；Last column representative function f5 is calculated array g by β, c, d and e.Figure (b) gives above-mentioned algorithm One CUDA realizes, it is assumed that f1, f2, f3 and f5 function can be parallelized, and therefore uses Kernel function to realize, respectively correspondence In Kernel1-Kernel4；F4 function can not be parallelized, and is the most still completed by CPU.Before Kernel1 and Kernel2 performs Needing to call cudaMemcpy and will input array a and b loading GPU memory, Kernel4 needs after performing to terminate to restore array g To CPU memory, calculating and the concurrency communicated to develop Kernel, Communication hiding expense, CUDA will in realizing The traffic operation of Kernel1, Kernel2, Kernel3 and correspondence thereof is set as asynchronous mode, and Kernel4 has used front 3 The output of individual Kernel function, first carries out the task flow simultaneously operating of the overall situation before therefore calling Kernel4.Figure (c) gives The task dependency graph of structure, it is assumed that the execution time of each task is as listed in figure (d), adds T1 to T2, G3 to G1 and G1 arrives The Resource Dependence limit of G2, the AOV network of generation is as shown in figure (e).Now utilize formula (1) and (2) that each joint can be extrapolated May starting the earliest with the permission time started at the latest as shown in Figure 5 of point.It can be seen that task C1, T1, T2, G2, S1, G4, T3 The earliest may time started and allow the time started equal the latest, they constitute the critical path of AOV net, as schemed in (f) Shown in shade node.Therefore the non-critical task that can carry out frequency regulation in system includes G1, G3 and C2.These 3 tasks can To be divided into the most inaccessible two groups, as shown in the dotted line frame in figure (f), can independently carry out regulating power consumption.The situation ratio of C2 Relatively simple, it may be respectively 1 and 13 with allowing the time started the latest the earliest, therefore according to formula (3-4) be given first fixed Condition, as long as the frequency of C2 regulation CPU does not affect the possible time started the earliest of S1.Understand C2 according to formula (1) to perform Time can be extended for 16, and when therefore performing task C2, the frequency of CPU is minimum is down to original 1/4, the energy expenditure fall of C2 For original 1/16.For G1 and G3, in like manner understand the frequency on GPU and regulate the possible time started the earliest that can not affect S1. Owing to the initial operating time of G1 and G3 is all 2, do not have a holiday or vacation and set energy that they consume under original frequency all as E, after regulation The operation time becomesWithAccording to formula (1), the operation time of G1 and G3 task energetic optimum is represented by

\{\begin{matrix} \min {{(\frac{2}{t_{G_{1}}^{'}})}^{2} E + {(\frac{2}{t_{G_{3}}^{'}})}^{2} E} \\ s . t . \max \{\begin{matrix} 1 + t_{G_{3}}^{'} \\ {EST}^{'} (G_{1}) + t_{G_{1}}^{'} = 17 \\ {EST}^{'} (G_{2}) + 10 \end{matrix} \end{matrix},

Wherein,

{EST}^{'} (G_{1}) = m a x \{\begin{matrix} 1 + t_{G_{3}}^{'} \\ 4 \end{matrix},

{EST}^{'} (G_{2}) = m a x \{\begin{matrix} {EST}^{'} (G_{1}) + t_{G_{1}}^{'} \\ 7 \end{matrix}

Solve above formula can obtain,Time, the minimum 8/9E of both total power consumption.

Above the specific embodiment of the present invention is described.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformation or amendment within the scope of the claims, this not shadow Ring the flesh and blood of the present invention.

Claims

1. a heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query, analyzes CUDA application program and exists Operation characteristic on CPU-GPU heterogeneous system, concludes task dependence therein, comprises the complete journey of many GPU task by one The procedural representation that performs of sequence is a kind of data structure AOV net based on figure, analyzes the critical path that program is run on this basis, Find out and program can carry out energy optimization part, solve the corresponding frequency amplitude of accommodation, before keeping program feature constant Put the overall power consumption of the program of minimizing, optimization to as if comprise the dynamic power consumption of whole program of multiple GPU task, Specific as follows:

Step1: first isolate from program that different types of task: CPU calculates task, communication task, GPU calculate task, Carry out list GPU task again being divided on CPU and GPU perform simultaneously, be that the dependence between task is closed by the process prescription of operation System, the AOV network of constructor running；

Step2: secondly running AOV network is analyzed, determines critical path, then CPU and GPU on non-critical path Task is can to carry out frequency regulation to save the non-critical task of power consumption；

Heterogeneous system dynamic power consumption optimization method the most according to claim 1, it is characterised in that Step1: particularly as follows: point Task dependence in analysis program sets up program task dependency graph.

Heterogeneous system dynamic power consumption optimization method the most according to claim 2, it is characterised in that Step2: particularly as follows: structure Make AOV network, there is resource contention between concurrently performing of task, it then follows arbitration mechanism: if same in a certain moment task flow Time there is multiple satisfied priority dependence but there is the task of resource contention, then the task that prioritizing selection task flow numbering is less Perform.

Heterogeneous system dynamic power consumption optimization method the most according to claim 1, it is characterised in that Step3 includes walking as follows Rapid: to determine earliest start time and the late start time of each task in program；With EST (M_i) represent task M_iDuring early start Between function, LST (M_i) represent task M_iLate Start function,<vj, vi>then represents that node j to perform prior to node i.

Earliest start time:

\{\begin{matrix} E S T (M_{1}) =0 \\ E S T (M_{i}) = m a x < v_{j}, v_{i} > &Element; E {E S T (M_{j}) + T i m e (M_{j}) \end{matrix} - - - (1)

Late start time:

\{\begin{matrix} L S T (M_{N}) = E S T (M_{N}) \\ L S T (M_{i}) = m i n < v_{i}, v_{j} > &Element; E {L S T (M_{j}) - T i m e (M_{i})} \end{matrix} - - - (2)

Therefore task M_iThe earliest with Starting Executing Time EST (M the latest_i) and LST (M_i) can obtain according to formula (1) and (2) recursion Go out.