CN101901042A

CN101901042A - Method for reducing power consumption based on dynamic task migrating technology in multi-GPU (Graphic Processing Unit) system

Info

Publication number: CN101901042A
Application number: CN2010102641204A
Authority: CN
Inventors: 过敏意; 马曦; 朱寅; 郑龙; 沈耀; 周憬宇; 曹朋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2010-08-27
Filing date: 2010-08-27
Publication date: 2010-12-01
Anticipated expiration: 2030-08-27
Also published as: CN101901042B

Abstract

The invention relates to a method for reducing power consumption based on a dynamic task migrating technology in a multi-GPU (Graphic Processing Unit) system in the technical field of computers, which comprises the following steps of: respectively mounting a GPU utilization ratio monitor on each GPU to obtain an average utilization ratio of each GPU in T time; when the utilization ratio of the i GPU is R1, migrating all tasks on the i GPU on the GPU the utilization ratio of which is R2 and shutting down the GPU; when the utilization ratio of the j GPU is 100 percent, migrating part of tasks on the j GPU on the GPU the utilization ratio of which is R3; when the utilization ratio of all running GPUs exceeds a threshold value R4, and a system is provided with the shutdown GPU, automatically starting a shutdown GPU by the system and allocating a new calculating task to the just started GPU; and continuously repeating the steps till all the GPUs are used for running a program. The invention has the function of monitoring a real-time source utilization ratio and can effectively reduce the power consumption of the GPUs and optimize the communication among the GPUs.

Description

In many GPU system based on the reducing power consumption method of dynamic task migrating technology

Technical field

What the present invention relates to is a kind of method of field of computer technology, specifically is based on the reducing power consumption method of dynamic task migrating technology in a kind of many GPU (Graphic Processing Unit, graphics processing unit) system.

Background technology

In recent years, GPU has obtained high speed development, and it is very suitable for high-efficiency and low-cost ground and carries out extensive high performance parallel numerical evaluation.GPU is a notion of deriving out from CPU (central processing unit), and it is the key component of video card, and the video card plate carries internal memory or shared CPU internal memory constitutes a subsystem by monopolizing, and becomes the key of PC system figure handling property.Increasing graphical application makes the status of GPU seem more and more important in the modern computer, this chip after CPU dominates the PC performance to move towards decades in recent years in emerge rapidly, in some applications even reached the status on an equal footing with CPU.NVIDIA company has taken the lead in proposing the notion of GPU when having issued the GeForce256 video card in 1999, this video card has reduced the dependence to CPU, and shares the part work of CPU, particularly when 3-D view is handled.The core technology that GPU adopted has hardware T﹠amp; L (Transform and Lighting, polygon conversion is handled with light source), cube environment texturing and fixed point is mixed, unity and coherence in writing compresses and concavo-convex mapping pinup picture, 256 render engines of dual texture pixel etc., their appearance has greatly improved the performance of machine aspect graphics process.GPGPU (General-purpose computing on graphics processing units, general GPU) is a kind of professional graphic process unit that can be engaged in script by the general-purpose computations task of central processing unit processing.Define from complete GPGPU, it not only can carry out graphics process, and can finish the computing work of CPU, thereby is fit to high-performance calculation more, and supports the programming language of higher level, and is more powerful on performance and versatility.Use from the GPGPU of narrow sense, GPGPU is exactly function intensified GPU, and the advantage of GPU also is the advantage of GPGPU naturally, and it has remedied the wretched insufficiency of CPU floating-point operation ability.Compare with GPU, the weakness of CPU maximum is exactly the floating-point operation scarce capacity.No matter be Intel's or AMD's CPU product, its floating-point operation ability is mostly below tens Gflops (per second 1,000,000,000 times) at present, and the floating-point operation ability of GPU just has been several times as much as the main flow processor in the time of 2006.

The Tesla C2050 of NVIDIA company issue in 2010 has 448 to handle core, bandwidth of memory reaches 144GB/S, power consumption is 247W, double-precision floating point computing and single-precision floating point computing have reached 515Gflops and 1Tflops (per second TFlops floating-point operation) especially, aspect Floating-point Computation, GPU has the irreplaceable high-performance of CPU as can be seen.It can effectively utilize powerful processing power of GPU and huge bandwidth of memory to carry out graph rendering calculating in addition, be widely used in fields such as Flame Image Process, video transmission, signal Processing, artificial intelligence, pattern-recognition, financial analysis, numerical evaluation, petroleum prospecting, astronomical calculating, fluid mechanics, biological computation, Molecular Dynamics Calculation, data base administration, coding encrypting, and CPU has obtained the speed-up ratio of one to two order of magnitude in these fields, has obtained the achievement that attracts people's attention.

Even if yet the GPGPU handling property surpasses one to two order of magnitude of common CP U, this still is difficult to satisfy the requirement of people to high-performance calculation concerning large-scale application system.Parallel computation can solve those problems that can reduce program runtime by a large amount of parallel computations to a certain extent.Present parallel system mainly realizes by distributed system, computer cluster, polycaryon processor and GPU.Development sequence is come in the main MPI storehouse of using in distributed system and the cluster; Mainly in the polycaryon processor use that POSIX develops multithread programs in OpenMP and the linux system; Mainly contain the GLSL of HLSL, the OpenGL of Microsoft, the RTSL of Stanford University at the exploitation of GPU, and on the innovation framework of up-to-date " Fermi " GPU of NVIDIA, the developer uses the CUDA programmed environment, in this environment, no matter select C language, C++, OpenCL, DirectCompute still to select the Fortran language, can both realize the parallel mechanism of application program, the developer can use NVIDIA Parallel Nsight instrument simultaneously.Aspect unit, GPU makes the ordinary desktop computing machine become individual supercomputer.NVIDIA Tesla people's supercomputer for example, it has nearly 960 parallel processing cores, 1T Flops floating-point operation ability based on revolutionary NVIDIA CUDA parallel computation framework, be equivalent to the arithmetic capability that a data center group system is had, thus faster more energy-conservation.

Different qualities according to CPU and GPU, CPU with operation system and Database Systems is a core, the GPU that calculates with the processing large-scale parallel is that coprocessor is current main flow framework, many GPU system satisfies the inexorable trend of following user's high performance demands, and many GPU system refers on the interior mainboard of a cabinet here a plurality of GPU.Yet have a technical problem here, though promptly the performance of GPU is fine, this has higher requirement to user's written program, should meet the programming model of different GPU frameworks, also will write out the program of different degree of concurrence at different GPU.

Champagne branch school, University of California Shane Ryoo, people such as Christopher I.Rodrigues 2008 are at parallel field top-level meeting PPoPP (Principles and Practice of Parallel Programming, the multiple programming principle and put into practice) on point out in the article of by name " Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA " (the using optimization principles and application performance evaluation of the multithreading GPU of CUDA) of delivering, on GeForce 8800GTX video card, different application will reach has more than 3000 thread parallel to carry out simultaneously, could hide the time bottleneck that bandwidth bottleneck between internal memory and the video memory and GPU read overall video memory.This GeForce8800GTX video card has 16 stream handles, 8 nuclears are arranged on each stream handle, have 108 physics kernels, handle core and have 448 on the Tesla C2050 video card, power consumption reaches 247W, if make full use of that these kernels reach or, have 3000 above thread parallels to carry out just passable simultaneously near its theoretical peak double-precision quantity 515Gflops.Therefore for the GPU cluster of forming by many GPU main frame that occurs on the high-end market, the Tesla S2050 video card 1U computing system of NVIDIA company for example, four GPU are arranged on the mainboard, double precision Theoretical Calculation peak value reaches 2.0TFLOPS, power consumption reaches 900W, efficent use of resources and reach the purpose of saving energy consumption how is one of problem of being concerned about most of GPU fabricator and application developer.

Mainly contain in the existing task allocation technique that three companies: NVIDIA is tall and handsome to be reached, the lucid company of AMD and Israel based on many GPU system.Wherein, tall and handsome what reach that company adopts is that SLI is Scalable Link Interface (expansion url interface) technology, main reference source http://www.slizone.com/page/slizone_learn.html: this technology can only be used for interconnected between the video card of same model; What AMD adopted is CrossFire (cross fire) technology, main reference source: http://game.amd.com/us-en/crossfirex_about.aspx, this technology can be applicable to interconnected between the video card of ATI different series, the user can abandon original video card in the upgrading video card like this, reaches the purpose of not wasting resource; Lucid company mainly adopts HYDRA Engine technology, HYDRA Engine is exactly the moderator of GPU, be responsible for the Task Distribution work of all arithmetic elements, its principal character not only can so that the video card of same brand different model work simultaneously, and can move the video card of different brands simultaneously, has very strong compatibility, main reference source: http://www.lucid-tech.com/.More than three kinds of technology mainly be from the Task Distribution angle, according to the load of application program, resource is calculated fast, and accurate Resources allocation, avoid causing the wasting of resources.If but above three kinds of technology and reckon without the situation that computational resource substantially exceeds computation requirement, for example but the thread of concurrent operation is limited, only one of needs or several GPU just can satisfy calculation requirement, the situation of a lot of GPU computing unit free time can occur like this, and cause unnecessary energy charge.

Summary of the invention

The objective of the invention is to overcome the above-mentioned deficiency of prior art, the reducing power consumption method based on dynamic task migrating technology is provided in a kind of many GPU system.The present invention to other GPU, and makes original GPU close by the task of migration on the GPU, thereby can obviously improve the utilization factor of GPU, reaches the beneficial effect of saving power consumption.

The present invention is achieved by the following technical solutions, the present invention includes following steps:

The first step, a GPU utilization factor monitor is set on each GPU respectively monitors that SP all on the GPU (stream processing unit) carries out the times N 1 of computing in time T, obtain each GPU at the average utilization μ=N1/N2 of T in the time, wherein: N2 is the theoretical peak of this GPU in T calculated amount in the time.

Second step when the utilization factor of i GPU is R1, was on the GPU of R2 to utilization factor with the whole task immigrations on i the GPU then, carried out for the 3rd step; When the utilization factor of j GPU is 100%, be on the GPU of R3 to utilization factor with the part task immigration on j the GPU, carried out for the 4th step.

The span of described R1 is: 0%≤R1＜20%.

The span of described R2 is: 25%≤R2＜90%.

The span of described R3 is: 25%≤R2＜90%.

Described migration, be: the content update among register on the GPUA and the cache at different levels is to the sheet of SP in the internal memory, memory content is updated among the GPU B by the SLI connector on the sheet among the GPUA among the SP, GPU B directly visits the global memory of GPU A, and the video memory of each GPU is with annular spread in the system, thus with the task immigration on the GPU A to GPU B.

In the 3rd step, after i GPU was transmitted to the GPU that utilization factor is R2 with new task, the utilization factor that i GPU no longer receives new task and i GPU was 0, closed i GPU automatically at this moment, carried out for the 4th step.

In the 4th step, when the utilization factor of all GPU that moving all surpassed threshold value R4 and system and has buttoned-up GPU, system started a buttoned-up GPU automatically, and give the GPU of firm startup with new distribution of computation tasks this moment.

The value of described threshold value R4 is: 80%≤R4≤90%.

The 5th step constantly repeated above-mentioned steps, until all GPU working procedure all.

Compared with prior art, the invention has the beneficial effects as follows:

1. the function for monitoring that has real-time resource utilization.The present invention does not need software intervention, and as long as come directly the utilization factor of GPU is monitored to have good actual effect by hardware, can in time finish the effectively start of GPU and closes.

2. can effectively reduce the power consumption of GPU, because the degree of parallelism of present a lot of application programs reaches the ability of supporting up to ten thousand the concurrent execution of thread that many GPU system can provide far away, therefore in a lot of application programs, can be by reducing the vacant time of GPU, it is closed reach the purpose of saving energy consumption, the performance that reduces power consumption is according to the different degree of parallelism of application program and difference, in the 4GPU system, and the highest power consumption that reduces this many GPU system 75%.

3. optimize the communication between the GPU, in commercial at present many GPU system, communication between the GPU mainly is to realize by PCI-Express or SLI connector, the former can only satisfy in the common application communicating requirement between the GPU, and the latter is a NVIDIA company at the GPU of GeForce 6600GT and later upgraded version, and it makes the theoretical peak of the communication speed between the GPU reach 1GB/S.

Description of drawings

Fig. 1 is the system construction drawing that monitors the GPU utilization factor among the embodiment;

Fig. 2 is a task immigration synoptic diagram among the embodiment.

Embodiment

Below in conjunction with accompanying drawing method of the present invention is further described: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Embodiment

System is with the tall and handsome Tesla S2050 1U computing system that reaches company number for " Fermi " framework in the present embodiment, there are 4 GPU in this system, the highest 515 Gigaflop (giga flops) the double precision peak performance that realizes of each GPU, thereby in the 1U space, can realize the double precision performance of 2 Teraflops, the 900W TDP.Concrete reducing power consumption may further comprise the steps:

The first step is provided with a GPU utilization factor monitor respectively and monitors that SP all on the GPU carries out the times N 1 of double-precision floating point computing in time T on each GPU, obtain each GPU at the average utilization μ=N1/N2 of T in the time.

N1 is a number of times of carrying out the double-precision floating point computing in time T in the present embodiment, and N2 is the theoretical peak 5.15 hundred million double-precision floating point number of calculations of each GPU in T calculated amount in the time, and T is 1 microsecond.

Each GPU has m SP in the present embodiment, each SP has n processing unit, can be that unit carries out SIMD (single instruction multiple data) operation with SP during program run, therefore every microsecond of using special register to write down SP in each SP is carried out operation times and is Si, be specially: whenever the control module of PM can preserved the calculated amount that this instruction will be issued ALU specially by a register when ALU (computational logic unit) sends data, according to the frequency of video card can accumulation calculating go out 1 delicate in the operand Si of each SP, the calculated amount addition of m different SP then, can draw the calculated amount of this GPU, compare with the delicate calculated amount of the theoretical peak 1 of GPU then, can draw the utilization factor Ri of this delicate interior GPU, as shown in Figure 1.

What use in the present embodiment is to share overall video memory technology, and when not having task immigration between GPU, each GPU uses specified memory; When between GPU task immigration being arranged, as when the task on the video card A need be moved to video card B, at first by in the internal memory on the sheet of the content update among register on the video card A and the cache at different levels in the SP, memory content is updated among the video card B by the SLI connector on the sheet among the video card A among the SP, and video card B can directly visit the global memory of video card A, reduced volume of transmitted data so effectively, time-delay visit during simultaneously for the data in other video memory of less visit, the video memory of several GPU is with annular spread, as shown in Figure 2.

R1 is less than 20% in the present embodiment, and R2 is between 25% to 75%, and R3 is between 25% to 75%.

The 4th step when the utilization factor of all GPU that moving all surpasses threshold value R4 and system and has buttoned-up GPU, then started a buttoned-up GPU automatically, and will be newly to Task Distribution reach more than the R5 for the video card of startup just up to the utilization factor of this video card.

R4 is 90% in the present embodiment, and R5 is 30%.

Present embodiment utilizes the video card of this 4GPU to calculate protein molecule field based on quantum chemistry, and the SPATIAL CALCULATION resolution that is adopted is 256X 256X 256, calculates 3 proteinoid molecular field visual Calculation results: 1A30,1GCV and 1DPS.Wherein, 1A30 is a HIV-1 proteinase, contains 201 amino acid residues, and conformation is the equal dimer that is respectively contained the C2 symmetry that 99 amino acid whose polypeptied chains form by 1 little impedance agent and 2, include 2 die bodys in each monomer, all form by antiparallel βZhe Die; 1GCV is a kind of haemoglobin, contains 552 amino acid residues; 1DPS is a DPS protein, contains 1855 amino acid residues, and its each monomer is roughly similar with the folded situation of ferritin, has aperture on its three-fold symmetry axle, and center tool cavity, this cavity are the important activity zones of ferric ion combination and release.

Because the complexity difference of three kinds of protein, the operand when calculating its molecular field is also inequality.Wherein the relative simple computation amount of 1A30 protein molecular structure is minimum, and its degree of parallelism also is minimum; The 1DPS protein molecular structure is the most complicated, and the parallel thread when carrying out this molecular field is also maximum, can relatively make full use of the GPU computational resource.In the present embodiment, have only when the parallel thread that is carved with during in difference among each GPU more than 10000, the utilization factor of 4 GPU just can reach more than 90%, and when simulated albumin matter molecular field, parallel thread quantity does not reach 10000 under many circumstances, reaches the purpose of saving energy consumption by closing GPU like this.When calculating 1A30 protein molecule field, if the power consumption of each GPU is unit 1, be 4 operation time altogether, the power consumption that needs when classic method is utilized GPU so is 1*4*4=16, and during the task migrating technology that proposes with present embodiment, the time of utilizing 4 GPU simultaneously is 0.5, utilizing 3 GPU times is 1.5, utilizing 2 GPU times is 1.5, utilizing 1 GPU time is 1, and be 4.5 total computing time, and power consumption is 4*0.5+3*1.5+2*1.5+1=10.5, budget speed is slow relatively 0.5/4=12.5%, the saving energy consumption is 5.5/16=34.4%.When calculating 1DPS protein molecule field, if the power consumption of each GPU is unit 1, be 15 operation time altogether, the power consumption that needs when classic method is utilized GPU so is 1*4*15=60, and during the task migrating technology that proposes with present embodiment, the time of utilizing 4 GPU simultaneously is 5, utilizing 3 GPU times is 6, utilizing 2 GPU times is 4, utilizing 1 GPU time is 2, and be 17 total computing time, and power consumption is 4*6+3*6+2*4+1*2=50, budget speed is slow relatively 2/17=11.8%, the saving energy consumption is 10/60=16.7%.

This shows that this task dynamic migration technology based on many GPU system that present embodiment proposes does not produce under the prerequisite of considerable influence performance in assurance, reaches the purpose of saving power consumption.

Claims

More than one kind in the GPU system based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that, may further comprise the steps:

The first step, a GPU utilization factor monitor is set on each GPU respectively monitors that SP all on the GPU carries out the times N 1 of computing in time T, obtain each GPU at the average utilization μ=N1/N2 of T in the time, wherein: N2 is the theoretical peak of this GPU in T calculated amount in the time;

Second step when the utilization factor of i GPU is R1, was on the GPU of R2 to utilization factor with the whole task immigrations on i the GPU then, carried out for the 3rd step; When the utilization factor of j GPU is 100%, be on the GPU of R3 to utilization factor with the part task immigration on j the GPU, carried out for the 4th step;

In the 3rd step, after i GPU was transmitted to the GPU that utilization factor is R2 with new task, the utilization factor that i GPU no longer receives new task and i GPU was 0, closed i GPU automatically at this moment, carried out for the 4th step;

In the 4th step, when the utilization factor of all GPU that moving all surpassed threshold value R4 and system and has buttoned-up GPU, system started a buttoned-up GPU automatically, and give the GPU of firm startup with new distribution of computation tasks this moment;

The 5th step constantly repeated above-mentioned steps, until all GPU working procedure all.
2. based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that the span of described R1 is: 0%≤R1＜20% in many GPU according to claim 1 system.
3. based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that the span of described R2 is: 25%≤R2＜90% in many GPU according to claim 1 system.
4. based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that the span of described R3 is: 25%≤R2＜90% in many GPU according to claim 1 system.
5. in many GPU according to claim 1 system based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that, described migration, be: the content update among register on the GPU A and the cache at different levels is to the sheet of SP in the internal memory, memory content is updated among the GPU B by the SLI connector on the sheet among the GPU A among the SP, GPU B directly visits the global memory of GPU A, and in the system video memory of each GPU with annular spread, thereby with the task immigration on the GPU A to GPU B.
6. based on the reducing power consumption method of dynamic task migrating technology, it is characterized in that the value of described threshold value R4 is: 80%≤R4≤90% in many GPU according to claim 1 system.