CN113434034B

CN113434034B - Large-scale cluster energy-saving method for adjusting CPU frequency of calculation task by utilizing deep learning

Info

Publication number: CN113434034B
Application number: CN202110774208.9A
Authority: CN
Inventors: 苏斌
Original assignee: Beijing Huaheng Shengshi Technology Co ltd
Current assignee: Beijing Huaheng Shengshi Technology Co ltd
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2023-04-18
Anticipated expiration: 2041-07-08
Also published as: CN113434034A

Abstract

The invention discloses a large-scale cluster energy-saving method for adjusting the CPU frequency of a computing task by utilizing deep learning. After the critical value is obtained, the CPU frequency of the computing node running the computing task is adjusted, so that the running efficiency of the computing task and the energy consumption of the machine reach a balanced state.

Description

Large-scale cluster energy-saving method for adjusting CPU frequency of calculation task by utilizing deep learning

Technical Field

The invention relates to the technical field of deep learning, in particular to a large-scale cluster energy-saving method for adjusting the frequency of a computing task CPU by utilizing deep learning.

Background

At present, in a large cluster, the CPU frequency of a computing node is fixed, and different computing tasks are operated by the same CPU frequency, so that the power consumption of a super-computing center is always kept at a high level. Some calculation tasks set the CPU frequency according to experience, which cannot effectively improve the performance of the calculation tasks and wastes resources.

The operation of different calculation tasks under the same CPU frequency is not beneficial to improving the performance of the calculation tasks and is also not beneficial to saving electricity of a large-scale cluster. The same CPU frequency may result in inefficient operation of the job or increased energy consumption of the machine. In the prior art, the balance between frequency and energy consumption is difficult to achieve, and even if the CPU frequency critical value of the running calculation task of the machine can be calculated by running a large amount of calculation operation, the operation of manually adjusting the frequency of the machine is very complicated.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a large-scale cluster energy-saving method for adjusting the CPU frequency of a calculation task by utilizing deep learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

the large-scale cluster energy-saving method for adjusting the CPU frequency of the calculation task by utilizing deep learning comprises the following specific processes:

after receiving job information submitted by a user, dispatching the job to a computing node which is most suitable for running a computing task of the job according to the collected load condition of each computing node;

when the computing node runs the computing task for the first time, adjusting the CPU frequency of the computing node to be the current CPU frequency of the computing node; in the operation process of a calculation task, operation data and calculation node operation data are collected once every set time, the operation data comprise operation time, the calculation node operation data comprise calculation node energy consumption and CPU frequency, a CPU frequency critical value is obtained through deep learning algorithm analysis by using the operation data and the calculation node operation data, accordingly, the CPU frequency of the calculation node is adjusted to the CPU frequency critical value, the CPU frequency of the calculation node is reduced, and energy conservation of the calculation node is achieved;

the specific process for analyzing the CPU frequency critical value by using the deep learning algorithm comprises the following steps:

the method comprises the steps of constructing a neural network model, wherein input variables of the neural network comprise operation running time, calculation node energy consumption and calculation node CPU frequency, counting respective weighted values of the three input variables, outputting a CPU frequency critical value H of the calculation node by using the obtained operation running time, calculation node energy consumption and CPU frequency as a data training set, adjusting the CPU frequency of the calculation node to a critical value, repeatedly verifying whether the critical value is correct, and adjusting the CPU frequency of the calculation node again if the critical value changes.

Furthermore, each computing node is provided with a neural network belonging to the computing node, and the operation running time of the neural network of each computing node, the energy consumption of the computing node and the weight value of the CPU frequency of the computing node need to be determined according to the actual running condition.

The invention has the beneficial effects that: the invention forms a training set by acquiring the energy consumption of the machine running the calculation task under different frequencies, and analyzes the critical value of the relation between the CPU frequency and the energy consumption of the machine when the calculation task is run by using a deep learning algorithm. After the critical value is obtained, the CPU frequency of the computing node running the computing task is adjusted, so that the running efficiency of the computing task and the energy consumption of the machine reach a balanced state.

Drawings

FIG. 1 is a schematic flow chart of a method of example 1 of the present invention;

FIG. 2 is a functional image of CPU frequency and power consumption plotted in example 1 of the present invention;

fig. 3 is a schematic diagram of a neural network model in embodiment 1 of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical scheme, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

Example 1

The embodiment provides a large cluster energy saving method for adjusting the CPU frequency of a computation task by using deep learning, as shown in fig. 1, the specific process is as follows:

after receiving job information submitted by a user, dispatching the job to a computing node which is most suitable for running a computing task of the job according to the collected load condition of each computing node (server);

when the computing node runs the computing task for the first time, adjusting the CPU frequency of the computing node to be the current frequency value of the computing node; in the running process of the computing task, job running data and computing node running data are collected once every set time, the job running data comprises job running time, the computing node running data comprises computing node energy consumption and CPU frequency, a CPU frequency critical value is obtained by utilizing the job running data and the computing node running data and analyzing through a deep learning algorithm, the CPU frequency of the computing node is adjusted to the CPU frequency critical value, the CPU frequency of the computing node is reduced, and energy conservation of the computing node is achieved.

The principles and processes for deriving the CPU frequency threshold using deep learning algorithm analysis are further described below.

The formula for calculating the energy consumption and the CPU frequency of the node is as follows:

P＝CV ² f；

p represents energy consumption; c is a constant and is determined by factors such as the manufacturing process and design of the calculation node; v represents a voltage; f represents the CPU frequency.

The energy consumption of the same calculation task is different under different calculation node CPU frequencies in the operation process, and the minimum energy consumption of the calculation task under a certain CPU frequency can be realized through deep learning of training data. As shown in fig. 2, a data training set is formed by obtaining frequencies of CPUs with different core numbers when processing the same calculation task and numerical values of energy consumption thereof, data in the data training set is refined into energy consumption median numbers under different CPU frequencies, and functional images of the CPU frequencies and the energy consumption are drawn, and the result is shown in fig. 2.

In fig. 2, the ordinate is the energy consumption P of the CPU in units of W; the abscissa is the CPU frequency f in MHz. As can be seen from FIG. 2, when the CPU frequency f reaches a threshold (red line), P and f reach threshold points.

(1) When f is less than or equal to the critical value, the P and f are in a linear relation, and the higher the energy consumption P is, the higher the CPU frequency f is, and the higher the execution efficiency of the calculation task is;

(2)f>when the critical value is reached, P and f lose the original linear relationship and show an exponential relationship, and at this time, the CPU energy consumption is greatly increased every time the CPU frequency of Δ f is increased, because the formula P = CV at this time ² In f, with the increase of f value, the required energy consumption is more and more, the CPU voltage is more and more, and V ² The proportion of (a) is gradually increased, and at this time:

when the CPU frequency f is increased by the same Δ f, a larger Δ P needs to be increased, so the function image exhibits the characteristic of an exponential function, and at this time, the CPU power consumption is rapidly increased every time the CPU frequency is increased by a part of Δ f.

Therefore, in this embodiment, the specific process of obtaining the CPU frequency threshold value by using the deep learning algorithm includes:

the CPU frequency is increased to improve the execution speed of the calculation task, but the increase of the CPU frequency can cause the calculation node to be in an overclocking state, the system is unstable, and the energy consumption is also increased rapidly. In order to meet the energy-saving requirement of green computing, the most appropriate CPU frequency needs to be found, so that the energy consumption of computing nodes is relatively low and the execution speed of computing tasks is high.

After a large number of same calculation tasks are repeatedly run on the calculation nodes, CPU frequency and electric quantity data consumed by the computer nodes in the running process of the calculation tasks can be obtained to form a training set, the training set is used for calculating the critical points of P and f, and the energy consumption of the calculation nodes exceeding the critical points can be greatly increased.

Meanwhile, the energy conservation of the server cluster cannot conflict with the execution of the computing task, and the execution efficiency of the computing task is influenced by considering the energy conservation at one step, so that the execution time of the computing task is also required to be considered when the CPU frequency critical value is obtained. Therefore, the method of the present embodiment builds a neural network model as shown in fig. 3.

The input variables of the neural network comprise operation running time X1, computing node energy consumption X2 and computing node CPU frequency X3, the weights of the three input variables are different for different computing nodes and computing tasks, the weight values of different computing nodes are counted according to actual running conditions, the CPU frequency critical value H of the computing node is output, the CPU frequency of the computing node is adjusted to the critical value, whether the critical value is correct or not is repeatedly verified, and the CPU frequency of the computing node is adjusted again when the critical value changes.

And the CPU frequency of the critical value of the computing node is found by considering various factors, so that the computing task can balance the operating efficiency and the energy consumption by keeping the CPU frequency as the critical value when the computing task operates.

It should be noted that the two-dimensional convolutional neural network of the present embodiment is formulated as

x _i，j As input variables, w _u-i，v-j H (u, v) is the output quantity of the H layer; input matrix

Weight value matrix->

Affine transformation is carried out according to a forward propagation formula of the convolutional neural network to obtain

Wherein, the feature vector of each layer before activation is z, and the feature vector after activation is y, that is, y = f (z); the input x of each layer can be regarded as a feature vector y of the previous layer after activation; the loss function is denoted by j:

the convolution kernel size is n x n, so the effective convolution is defined as

Wherein, w _rot The matrix w is rotated by 180 deg..

Therefore, the h-th layer output formula of the convolutional neural network is as follows:

b ^h for error loss, it is negligible in calculation.

In this embodiment, the input matrix and the output matrix obtained when a certain computation task runs are substituted into a formula to obtain

The coefficient result changes due to the difference of the type and the duration of each operation and the machine load, the calculation coefficients during the operation of the same type of operation are collected to form a data set, and the CPU frequency of the median adjusting system is obtained.

It should be noted that each computing node has its own set of deep learning algorithm framework, different CPU frequency critical values can be calculated for different machine models and computing task use conditions, and adjusting the computing node CPU frequency to the critical value can save computing node energy consumption to the greatest extent and ensure computing task operating efficiency.

Example 2

The method in embodiment 1 may be combined with a computing task management system, and the computing task management system may acquire energy consumption data information of a relevant computing node, and may acquire real-time data information through a command, as shown in table 1.

TABLE 1

host	cpuf	P	job_name	ave_job_time
					quickpool-1	1300	0.45	test1	105

In table 1, the display information includes the node name, the node CPU frequency, the node energy consumption P, the name of the job that runs in batch on the corresponding computing node at this time, and the average time of job execution. The calculation task of the operation is started to run at the current CPU frequency, and then the CPU frequency is gradually increased by a fixed time interval, so that the node energy consumption and the operation execution efficiency are improved, and the average operation interval is shortened. The information changes of the process are shown in tables 2, 3 and 4.

TABLE 2

host	cpuf	P	job_name	ave_job_time
					quickpool-1	1500	0.51	test1	99

TABLE 3

host	cpuf	P	job_name	ave_job_time
					quickpool-1	1650	0.59	test1	89

TABLE 4

host	cpuf	P	job_name	ave_job_time
					quickpool-1	1750	0.85	test1	84

At this time, when the CPU frequency reaches 1750, the energy consumption slowly increases by 0.8W from the CPU frequency increased by 100Hz, and suddenly changes to the energy consumption increased by 0.26W from the CPU frequency increased by 100Hz, and the energy consumption amount conforms to the turning point of the previous energy consumption curve, and at this time, the CPU frequency gradually decreases, as shown in tables 6 and 7.

TABLE 6

TABLE 7

host	cpuf	P	job_name	ave_job_time
					quickpool-1	1640	0.59	test1	89

When the CPU frequency is stabilized at about 1600-1650Hz, the energy consumption of the machine and the operation execution efficiency are dynamically balanced. The method and the computing task management system in the embodiment 1 can be perfectly integrated to form integrated control of cluster management and energy conservation.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A large-scale cluster energy-saving method based on a deep learning algorithm is characterized by comprising the following steps:

when the computing node runs the computing task for the first time, adjusting the CPU frequency of the computing node to be the current CPU frequency of the computing node;

in the running process of the computing task, collecting operation data and computing node operation data at set intervals, wherein the operation data comprises operation running time, and the computing node operation data comprises computing node energy consumption and computing node CPU frequency;

constructing a neural network model based on a deep learning algorithm, taking operation running time, computing node energy consumption and computing node CPU frequency as input variables of the neural network model, wherein the three input variables have corresponding input weight values, taking the operation running time, the computing node energy consumption and the computing node CPU frequency of different nodes in the same computing task as a data training set, and training the neural network model to obtain a trained neural network;

inputting the operation running time, the energy consumption of the computing nodes, the CPU frequency of the computing nodes and the corresponding weight values of the three obtained in the actual computing task into a trained neural network, and outputting the CPU frequency critical value of the computing nodes by the trained neural network; when the CPU frequency of the computing node is less than or equal to the CPU frequency critical value, the energy consumption of the computing node and the CPU frequency of the computing node are in a linear relation; when the CPU frequency of the computing node is greater than the CPU frequency critical value, the energy consumption of the computing node and the CPU frequency of the computing node are in an exponential relation;

adjusting the CPU frequency of the computing node to the CPU frequency critical value, repeatedly verifying whether the CPU frequency critical value is correct, and if the CPU frequency critical value changes, adjusting the CPU frequency of the computing node again;

wherein each of the compute nodes has its own neural network.