CN115373507A

CN115373507A - Whole machine resource balance management method and system based on electric energy loss

Info

Publication number: CN115373507A
Application number: CN202211316675.8A
Authority: CN
Inventors: 耿春胜
Original assignee: Beijing Pinli Technology Co ltd
Current assignee: Beijing Pinli Technology Co ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2022-11-22
Anticipated expiration: 2042-10-26
Also published as: CN115373507B

Abstract

A complete machine resource balance management method and a system based on electric energy loss relate to the technical field of computer electric energy management, and the method comprises the following steps: acquiring a GPU resource consumption curve of each task, determining a GPU resource consumption upper limit value of each task, determining a prepared resource value of each task based on a preset proportion of the GPU resource consumption upper limit value of each task, and determining a GPU resource theoretical total occupation value of each task in combination with the GPU resource consumption upper limit value; and determining a total GPU resource value according to the GPU resource theoretical total occupation value of each task, determining GPU hardware needing to be started according to a preset GPU hardware resource table, migrating all tasks to the GPU hardware needing to be started, and finally closing the GPU hardware without running the tasks, so that the power consumption of the GPU cluster host is reduced to the maximum extent.

Description

Whole machine resource balance management method and system based on electric energy loss

Technical Field

The invention relates to the technical field of computer electric energy management, in particular to a complete machine resource balance management method and system based on electric energy loss.

Background

Because the provided computing power of a single computer is prior, when a computing task with a large computing requirement needs to be processed, a cluster is usually used for computing, and the cluster is a super computer formed by interconnecting a plurality of computers through a high-speed network. The GPU is also a kind of computing resource, and currently, computing operations related to artificial intelligence and machine learning are generally performed by using the GPU. Usually, each node on the cluster configured with GPU resources will have multiple GPU graphics cards installed, for example, 8 GPU graphics cards installed or more, so the total number of GPU graphics cards in the cluster is very large.

A GPU cluster is a computer cluster in which each node is equipped with a graphics processing unit, and by utilizing the computational power of modern GPUs through general-purpose computation on the graphics processing unit, very fast computations can be performed using the GPU cluster.

When the computer cluster is applied, even in a state of very low task pressure, most GPUs still operate at a lower frequency, thereby causing a waste of a part of performance resources.

Disclosure of Invention

The invention aims to provide a method and a system capable of configuring electric energy of a GPU cluster control whole machine.

The invention discloses a complete machine resource balance management method based on electric energy loss, which comprises the following steps:

acquiring a GPU resource consumption curve of each task, determining a GPU resource consumption upper limit value of each task, determining a prepared resource value of each task based on a preset proportion of the GPU resource consumption upper limit value of each task, and determining a GPU resource theoretical total occupation value of each task in combination with the GPU resource consumption upper limit value;

and determining a total GPU resource value according to the GPU resource theoretical total occupation value of each task, determining GPU hardware needing to be started according to a preset GPU hardware resource table, migrating all tasks to the GPU hardware needing to be started, and finally closing the GPU hardware without task operation.

In some embodiments of the present application, in order to determine the GPU hardware that needs to be started, contents of the GPU hardware resource table are disclosed, and the contents of the GPU hardware resource table include:

and the GPU hardware is correspondingly set with position information, performance resource values and power consumption values aiming at each GPU hardware.

In some embodiments of the present application, in order to avoid overload of GPU hardware due to a newly injected task, the overall resource balancing management method is improved, and the overall resource balancing management method further includes:

and determining and starting prepared GPU hardware according to the total needed GPU resource value and the GPU hardware resource table, wherein the prepared GPU hardware is used for preparing to run the newly added task.

In some embodiments of the present application, in order to avoid overload of the GPU hardware caused by a newly injected task, the method for determining the preparation GPU hardware includes:

determining a total required prepared resource value according to a preset proportion of the total required GPU resource value, and determining prepared GPU hardware needing to be called according to the GPU hardware resource table, wherein the performance resource value of the prepared GPU hardware is larger than or equal to the total required prepared resource value.

In some embodiments of the present application, a method for presetting the GPU hardware resource table is disclosed, and the method for presetting the GPU hardware resource table includes:

after the position information of the GPU hardware is obtained, sending a virtual task to the GPU hardware so as to enable the GPU hardware to run at full load;

evaluating the execution effect of the virtual task based on the GPU hardware to determine a performance resource value of the GPU hardware;

and when the GPU hardware runs at full load, acquiring the power consumption of the GPU hardware so as to determine the power consumption value of the GPU hardware.

In some embodiments of the present application, in order to determine the GPU hardware that needs to be booted, a method for applying the GPU hardware resource table is disclosed, and the method for applying the GPU hardware resource table includes:

and determining a plurality of GPU hardware as the GPU hardware needing to be started according to the total needed GPU resource value and the GPU hardware resource table, so that the total performance resource value of the GPU hardware needing to be started is larger than the total needed GPU resource value, and the total energy consumption value of the GPU hardware is the lowest.

In some embodiments of the present application, a method for applying the GPU hardware resource table is further disclosed, and the method for applying the GPU hardware resource table further includes:

establishing a temporary GPU hardware call table, successively filling position information of GPU hardware in the temporary GPU hardware call table according to the total needed GPU resource value, and successively calculating the total performance resource value of the GPU hardware in the temporary GPU hardware call table;

and if the total performance resource value of the GPU hardware in the temporary GPU hardware call list is larger than or equal to the total needed GPU resource value, determining the GPU hardware corresponding to the position information of the GPU hardware recorded in the temporary GPU hardware call list as the GPU hardware needing to be started.

In some embodiments of the present application, in order to minimize the total energy consumption value of the GPU hardware that needs to be started, the method for applying the GPU hardware resource table further includes a method further including:

setting the performance resource value of the GPU hardware as a, setting the energy consumption value as b, setting the performance energy consumption ratio of the GPU hardware as h, and calculating the performance energy consumption ratio h by h = a/b;

and when the temporary GPU hardware call table is established, sequentially calling GPU hardware according to the relation that the performance energy consumption ratio h is reduced from large to small.

According to the overall resource balance management method based on the electric energy loss, the actually needed GPU resources are determined by establishing the relation between tasks and GPU resource consumption, GPU hardware needing to be started is determined according to the actually needed GPU resources, all tasks are migrated to the GPU hardware needing to be started, and then the GPU hardware without task execution is closed, so that the power consumption of a GPU cluster host is reduced to the maximum extent.

In some embodiments of the present application, a system for balancing and managing overall resources based on power consumption is disclosed, the system comprising:

the resource occupation condition acquisition unit is used for acquiring the GPU performance resource occupancy rate of the task;

the analysis unit is used for generating a GPU resource consumption curve according to the GPU performance resource occupation value, analyzing and determining the GPU resource consumption upper limit value of the task in a preset time period, determining the prepared resource value of the task based on the preset proportion of the GPU resource consumption upper limit value, determining the theoretical total occupation value of the GPU resource of the task by combining the GPU resource consumption upper limit value, determining the total GPU resource value according to the theoretical total occupation value of the GPU resource of each task, determining GPU hardware needing to be started according to a preset GPU hardware resource table, migrating all tasks to the GPU hardware needing to be started, and determining the GPU hardware needing to be stopped and not running the tasks;

and the power supply control unit is used for connecting or disconnecting the power supply of the GPU hardware according to the determination condition that the analysis unit needs to be started or closed on the GPU hardware.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

Fig. 1 is a method step diagram of a method for balancing management of resources of a whole machine based on power consumption in an embodiment of the present application;

fig. 2 is a flowchart of performing energy consumption control on a GPU cluster in this embodiment.

Detailed Description

The technical solution of the present invention is further illustrated by the accompanying drawings and examples.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of "first," "second," and similar terms in the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Example (b):

The invention discloses a complete machine resource balance management method based on electric energy loss, and with reference to fig. 1, the method comprises the following steps:

and S100, acquiring a GPU resource consumption curve of each task, and determining the GPU resource consumption upper limit value of the task.

And S200, determining a prepared resource value of the task based on a preset proportion of the GPU resource consumption upper limit value of the task, and determining a GPU resource theoretical total occupancy value of the task by combining the GPU resource consumption upper limit value.

The upper limit value of the GPU resource consumption is the same as the fixed value standard of the prepared resource value, the resource value of the GPU hardware needs to be determined before the fixed value, the resource value of the GPU hardware is the capacity value of the GPU hardware for processing tasks, and the capacity value can be specifically set through the rendering effect of a graph or the execution effect of an algorithm. The purpose of setting the prepared resource value of the task is to avoid the problem of GPU hardware running overload in the execution process of a single task, so the prepared resource value is reserved. And the GPU total occupation value of the task is equal to the sum of the theoretical total occupation value of the GPU resources and the prepared resource value.

Step S300, determining a total GPU resource value according to the theoretical total occupation value of the GPU resources of each task, and determining GPU hardware needing to be started according to a preset GPU hardware resource table.

And step S400, migrating all tasks to GPU hardware needing to be started, and finally closing the GPU hardware without running the tasks.

In order to determine the GPU hardware that needs to be booted, in some embodiments of the present application, contents of the GPU hardware resource table are disclosed, and the contents of the GPU hardware resource table include:

In some embodiments of the present application, in order to avoid overload of GPU hardware due to a newly injected task, the overall resource balancing management method is improved, and the overall resource balancing management method further includes: and determining and starting prepared GPU hardware according to the total needed GPU resource value and the GPU hardware resource table, wherein the prepared GPU hardware is used for preparing to run the newly added task.

the method comprises the steps of firstly, after position information of GPU hardware is obtained, sending a virtual task to the GPU hardware so as to enable the GPU hardware to run at full load.

The virtual task can be a virtual task based on a neural network learning algorithm and can also be a virtual task based on dynamic image rendering.

And secondly, evaluating the execution effect of the virtual task based on the GPU hardware so as to determine the performance resource value of the GPU hardware.

The evaluation method is that the same virtual task is set first, and the effect of executing the virtual task is used as the evaluation criterion.

And thirdly, acquiring the power consumption of the GPU hardware when the GPU hardware runs at full load so as to determine the power consumption value of the GPU hardware.

Wherein the energy consumption value and the consumed power are in direct proportion.

In order to avoid overload of GPU hardware caused by a newly injected task, in some embodiments of the present application, the method for determining the preparation GPU hardware includes: and determining a total required prepared resource value according to a preset proportion of the total required GPU resource value, and determining prepared GPU hardware needing to be called according to the GPU hardware resource table, wherein the performance resource value of the prepared GPU hardware is greater than or equal to the total required prepared resource value.

In comparison with the prepared resource value set for each GPU hardware, the above-mentioned prepared GPU hardware does not refer to reserving a part of performance resources in each GPU hardware, and in contrast, the prepared GPU hardware is set to avoid the situation that the task execution effect is poor due to the overload operation of the injected GPU hardware caused by the newly injected task.

In some embodiments of the present application, in order to determine the GPU hardware that needs to be started, a method for applying the GPU hardware resource table is disclosed, and the method for applying the GPU hardware resource table includes:

and determining a plurality of GPU hardware as GPU hardware needing to be started according to the total needed GPU resource value and the GPU hardware resource table, so that the total performance resource value of the GPU hardware needing to be started is larger than the total needed GPU resource value, and the total power consumption value of the GPU hardware is the lowest.

the method comprises the steps of firstly, establishing a temporary GPU hardware call table, filling position information of GPU hardware in the temporary GPU hardware call table successively according to the total needed GPU resource value, and calculating the total performance resource value of the GPU hardware in the temporary GPU hardware call table successively.

And secondly, if the total performance resource value of the GPU hardware in the temporary GPU hardware call table is larger than or equal to the total needed GPU resource value, determining the GPU hardware corresponding to the position information of the GPU hardware recorded in the temporary GPU hardware call table as the GPU hardware needing to be started.

firstly, setting the performance resource value of the GPU hardware as a, setting the energy consumption value as b, setting the performance energy consumption ratio of the GPU hardware as h, and calculating the performance energy consumption ratio h as h = a/b.

And secondly, sequentially calling GPU hardware according to the relationship from large to small of the performance energy consumption ratio h when the temporary GPU hardware calling table is established.

According to the complete machine resource balance management method based on the electric energy loss, the actually needed GPU resources are determined by establishing the relation between tasks and GPU resource consumption, GPU hardware needing to be started is determined according to the actually needed GPU resources, all the tasks are migrated to the GPU hardware needing to be started, and then the GPU hardware without task execution is closed, so that the power consumption of a GPU cluster host is reduced to the maximum extent.

To further illustrate the technical solution of the present application, a specific application scenario is now disclosed to explain the technical solution of the present application.

In order to implement the tasks of rendering and arithmetic operations of large-scale graphics, it is necessary to establish a GPU cluster, which is a computer cluster in which each node is equipped with a Graphics Processing Unit (GPU). Taking advantage of the computational power of modern GPUs through general-purpose computation on a graphics processing unit (GPGPU), very fast computations can be performed using a cluster of GPUs.

In order to solve the problem that a plurality of GPU hardware still consumes electric energy resources due to the fact that a GPU cluster has a running state with low task amount when executing tasks, a complete machine resource balance management system based on electric energy loss is needed, and the management system comprises an analysis unit, a resource occupation acquisition unit and a power supply control unit.

Referring to fig. 2, the method for controlling the GPU cluster by using the analysis unit, the resource occupation acquisition unit, and the power control unit includes the steps of:

firstly, a GPU resource consumption curve of each task is obtained.

And secondly, determining the upper limit value of GPU resource consumption.

And thirdly, determining the prepared resource value of the task.

And fourthly, determining the theoretical total occupation value of the GPU resources.

And fifthly, determining a total GPU resource value.

And sixthly, determining GPU hardware needing to be started.

And seventhly, migrating the task to GPU hardware needing to be started.

And step eight, closing GPU hardware without task operation.

The GPU resource consumption curve of each task is generated in a mode that the GPU performance resource occupancy rate is obtained through the resource occupancy condition obtaining unit, the GPU performance resource occupancy rate is estimated by taking the GPU hardware performance resource value determined by the analysis unit as a standard, and then the GPU resource consumption curve of each task is generated.

The analysis unit is used for generating a GPU resource consumption curve according to the GPU performance resource occupation value, analyzing and determining a GPU resource consumption upper limit value of a task in a preset time period, determining a prepared resource value of the task based on a preset proportion of the GPU resource consumption upper limit value, determining a GPU resource theoretical total occupation value of the task by combining the GPU resource consumption upper limit value, determining a total GPU resource required value according to the GPU resource theoretical total occupation value of each task, determining GPU hardware required to be started according to a preset GPU hardware resource table, migrating all tasks to GPU hardware required to be started, and determining GPU hardware required to be stopped and operated without the tasks.

According to the method and the system for the complete machine resource balanced management based on the electric energy loss, the GPU resources which are actually needed are determined by establishing the relation between the tasks and the GPU resource consumption, the GPU hardware which needs to be started is determined according to the relation, all the tasks are migrated to the GPU hardware which needs to be started, and then the GPU hardware which does not execute the tasks is closed, so that the power consumption of a GPU cluster host is reduced to the maximum extent.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the invention without departing from the spirit and scope of the invention.

Claims

1. A complete machine resource balance management method based on electric energy loss is characterized by comprising the following steps:

2. The overall machine resource balance management method based on the electric energy loss according to claim 1, wherein the content of the GPU hardware resource table comprises:

and the plurality of GPU hardware are correspondingly set with position information, performance resource values and energy consumption values aiming at each GPU hardware.

3. The overall resource balance management method based on the electric energy loss as claimed in claim 2, wherein the overall resource balance management method further comprises:

4. The power consumption-based complete machine resource balance management method according to claim 3, wherein the method for determining the standby GPU hardware comprises the following steps:

and determining a total required prepared resource value according to a preset proportion of the total required GPU resource value, and determining prepared GPU hardware needing to be called according to the GPU hardware resource table, wherein the performance resource value of the prepared GPU hardware is greater than or equal to the total required prepared resource value.

5. The power consumption-based complete machine resource balance management method according to claim 2, wherein the method for presetting the GPU hardware resource table comprises the following steps:

6. The power consumption-based complete machine resource balance management method according to claim 2, wherein the method for applying the GPU hardware resource table comprises the following steps:

7. The power consumption-based complete machine resource balance management method according to claim 6, wherein the method for applying the GPU hardware resource table further comprises the following steps:

establishing a temporary GPU hardware call table, filling position information of GPU hardware in the temporary GPU hardware call table successively according to the total needed GPU resource value, and calculating the total performance resource value of the GPU hardware in the temporary GPU hardware call table successively;

and if the total performance resource value of the GPU hardware in the temporary GPU hardware call table is larger than or equal to the total needed GPU resource value, determining the GPU hardware corresponding to the position information of the GPU hardware recorded in the temporary GPU hardware call table as the GPU hardware needing to be started.

8. The power consumption-based complete machine resource balance management method according to claim 7, wherein the method for applying the GPU hardware resource table further comprises the following steps:

setting the performance resource value of the GPU hardware as a, setting the energy consumption value as b, setting the performance energy consumption ratio of the GPU hardware as h, and obtaining the performance energy consumption ratio h by the calculation method of h = a/b;

and when the temporary GPU hardware call table is established, sequentially calling GPU hardware according to the relation that the performance energy consumption ratio h is decreased from large to small.

9. A complete machine resource balance management system based on electric energy loss is characterized by comprising: