CN111601490B

CN111601490B - Reinforced learning control method for data center active ventilation floor

Info

Publication number: CN111601490B
Application number: CN202010456237.6A
Authority: CN
Inventors: 万剑雄; 周杰; 熊伟
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2022-08-02
Anticipated expiration: 2040-05-26
Also published as: CN111601490A

Abstract

A reinforcement learning control method for an active ventilation floor of a data center is characterized in that a Markov decision process model is established for the problem of a rack hotspot of a lifting floor structure data center, a reinforcement learning model solving algorithm is provided as the core of the reinforcement learning control algorithm, the rotating speed of a fan of the active ventilation floor (a floor with the fan attached to the back of a common ventilation floor) is intelligently controlled according to the current rack temperature distribution on the premise of not improving the air conditioning power of a machine room, and the rack inlet temperature distribution is homogenized by the mode of actively conveying sufficient cold air, so that the problem of the rack hotspot ubiquitous in the data center of the lifting floor structure is solved, the refrigeration energy consumption is saved, and the safety and the stability of a server are ensured. Compared with the existing data center rack-level airflow management method, the method is easier to deploy, more cost-effective and stronger in universality.

Description

Reinforced learning control method for data center active ventilation floor

Technical Field

The invention belongs to the technical field of automatic control, and particularly relates to a reinforcement learning control method for an active ventilation floor of a data center.

Background

The rack hot spot is a high-temperature point at which the temperature of one or more positions of the rack of the data center machine room is obviously higher than that of other positions. Excessive temperatures can cause some servers in a data center to operate less efficiently, thereby reducing its overall power density and also reducing its reliability, which is clearly contrary to the needs of data centers.

The hot spots of the racks are relieved or eliminated by adopting a global regulation and control mode, for example, the power of an air conditioner in a machine room is increased to provide sufficient cold air, so that most of the rack areas are in an over-cooling state inevitably, and the total energy consumption of the data center is more huge than half of the total energy consumption of the data center while the waste of cooling resources is caused. Thus, rack level cooling solutions are more suitable for mitigating rack hot spot issues.

There are currently rack-level refrigeration solutions, such as installing adaptive ventilation floors, installing baffles, enclosing individual racks and providing them with ventilation ducts, etc. However, these solutions are "passive" cooling solutions, which do not actively provide a cooling air flow to the racks, and they are not sufficient when the cooling air supply is insufficient.

The active ventilation floor is used as another rack-level refrigeration scheme, the hot spot problem of the rack is relieved by actively conveying cold air, and compared with the scheme, the active ventilation floor is easier to deploy and more cost-effective, but the control difficulty is mainly characterized by the diversity and the dynamic property of the placing environment, such as different distribution of machine room air conditioners, relative positions of racks and servers in the racks; the cold and hot channels are in different closed states, and the server rack is in different standards and sealing conditions; the room air conditioning power, the thermal load of different rack servers, etc. Therefore, the thermal energy efficiency and airflow model of the data center is generally difficult to describe by an analytic model.

Most of the existing active ventilation floor related researches are performance modeling and evaluation based on measurement or simulation, and no research literature of active ventilation floor control problems exists at present.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a reinforcement learning control method for an active ventilation floor of a data center, which automatically learns an optimal operation strategy and plans air flow of a rack on the premise of not increasing the power of an air conditioner of a machine room, so that the temperature distribution of the rack is uniform, and the hot spot problem of the rack is relieved. And complex airflow and heat exchange models do not need to be established and calibrated, so that the universality of the active ventilation floor is improved.

In order to achieve the purpose, the invention adopts the technical scheme that:

a reinforcement learning control method for an active ventilation floor of a data center is used for establishing a Markov decision process model for a rack hotspot problem of a lifting floor structure data center and providing a reinforcement learning model solving algorithm and an array algorithm as the core of a reinforcement learning control algorithm. The model consists of four parts, namely a system state, a behavior, an incentive and a value function, the solution of the model is that the optimal behavior is continuously selected under a series of system states to maximize the accumulated incentive of the system, the reinforcement learning control algorithm utilizes whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as evaluation standards, and adjusts the rotating speed of the fan of the active ventilation floor by continuously exploring and learning the complex relation between the duty ratio value of the PWM signal and the rising, the lowering or the constant maintaining of the value, so that the temperature distribution of the air inlet of the rack is uniform, and the hot spot problem of the rack is relieved.

Compared with the prior art, the invention has the beneficial effects that:

the invention does not need to establish and calibrate complex airflow and heat exchange models, uses an array control algorithm, overcomes the diversity and the dynamic property of the placing environment of the active ventilation floor, automatically matches the relationship between the duty ratio value of the PWM signal and the rise, the fall or the maintenance of the value according to whether the temperature distribution of the air inlet of the rack is uniform and the energy consumption of the active ventilation floor, and only needs to replace the original common ventilation floor with the active ventilation floor for operating the invention.

Compared with an intelligent control method using three intelligent algorithms, the reinforcement learning control method using the array algorithm is simpler and requires less computing resource overhead.

Compared with the reinforcement learning control method using the array algorithm, the definition of the intelligent control method using the three intelligent algorithms on the state and the behavior is more direct and effective for solving the hot spot problem, and the non-discretization state definition and the approximation to the Q function strengthen the universality of the intelligent control method.

Drawings

FIG. 1 is a diagram of active vent floor design and deployment. In the figure, reference numeral 1 is a temperature sensor, 2 is a rack, 3 is a microcontroller, 4 is a driving board, 5 is a switching power supply, 6 is a PC, and 7 is an active ventilation floor.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

Fig. 1 is a detailed deployment implementation schematic diagram of the invention, wherein a certain number of temperature sensors 1 are uniformly distributed at an air inlet of a rack 2 to monitor the temperature distribution of the air inlet of the rack 2, and a temperature sensor two is additionally arranged below an active ventilation floor to monitor the air supply temperature below the active ventilation floor.

In the field, the rack 2 is a rectangular iron box, a certain number of servers are placed in the box, and a plurality of racks are placed in a row. In a certain row of racks, a left panel and a right panel of a certain rack are generally tightly attached to other racks, a front panel of the rack is an air inlet for sucking cold air to refrigerate the server, a rear panel of the rack is an air outlet for discharging refrigerated hot air, the temperature distribution of the air inlet of the rack is monitored, namely the temperature of certain positions of the front panel of the rack is monitored, the temperature of the positions forms the temperature distribution of the air inlet of the rack, and therefore the number of the first temperature sensors 1 depends on the number of the positions.

The active ventilation floor reinforcement learning control method runs at a PC (personal computer) end, a PC6 is connected with a microcontroller 3, the microcontroller 3 is connected with a drive plate 4, and the drive plate 4 is connected with an active ventilation floor fan 7 after being connected with a switching power supply 5(12V, 20A). According to the temperature distribution returned by the temperature sensor I1, a duty ratio value of a PWM signal is generated and transmitted to the microcontroller 3, the microcontroller 3 generates a corresponding PWM signal according to the duty ratio value and transmits the PWM signal to the drive plate 4, the drive plate 4 controls the voltage provided by the switching power supply 5 to the active ventilation floor fan 7 according to the PWM signal, and the purpose of adjusting the rotating speed of the fan is achieved by controlling the power supply voltage of the fan.

The control method comprises the following steps:

1. a Markov decision process model is established for the frame hotspot problem of a raised floor structure (a data center air supply structure, a data center machine room floor is elevated, and a 60-100cm high underfloor space is reserved for conveying cold air by a machine room air conditioner, the structure is the raised floor structure, and most of domestic data centers adopt the structure at present) and consists of the following four parts:

system A State s _t Defined as the duty ratio of the discretized square wave of the PWM signal, the formula is as follows:

s _t for the system state at the time t,

is a state space, s is

DC is the value of the square wave duty ratio of the PWM signal, max (DC) is the maximum value of DC, D _TQ For DC discretization of the equivalence ratio, k denotes D in a certain state _TQ The number of (2).

B System behavior space

Defined as the change in the speed of rotation of the actively ventilated floor fan, i.e.

C award R _t+1 The system consists of a quantitative index of the uniform degree of the temperature distribution of an air inlet of a rack and the energy consumption of an active ventilation floor fan, and has the formula as follows:

wherein R is _t+1 The reward obtained after the system takes some action for time t,

the temperature distribution uniformity of the air inlet of the rack is shown, the more the formula value is negative, the closer to 0, the more uniform the temperature distribution of the air inlet of the rack is, the T _t,i The temperature reading of the first temperature sensor numbered i at time t,

for the reference temperature of the rack at time t,

T _t,under is the reading of the second temperature sensor at time t, Δ _T The fixed temperature difference set according to the mixing degree of the cold air and the hot air on the active ventilation floor is positive,

is a collection of the first temperature sensors,

the total number of the temperature sensors is one; - (A) _ref ×DC _t ) ³ Representing the active ventilation floor fan energy consumption, the values of the formula are all negative, the closer to 0, the lower the fan energy consumption, wherein A _ref To maintain a reference behavior value of the same order of magnitude as the uniformity of the temperature distribution at the inlet of the frame, DC _t The square wave duty ratio of the PWM signal at time t.

D cost function Q(s) _t ,a _t ) As a behavior cost function, the formula is:

wherein the merit function Q (s, a) is referred to as the Q function,

for the action taken by the system at time t,

for the expectation function, y is the future time relative to time t, R _t+y+1 Represents the reward obtained after the system takes action at the time t + y, gamma represents the attenuation factor and represents the attention degree of the model to the future reward (environmental influence), gamma is more than or equal to 0 and less than 1 ^y Y power of gamma, is t + y time R _t+y+1 The attenuation factor of (2).

The E markov decision process model can be summarized as maximizing the cumulative reward by selecting the optimal behavior at any time t system state, with the model formula:

is constrained to

γ ^t Is time t system R _t+1 The attenuation factor of (2).

2. Model solution and solving algorithm

and (b) calculating to obtain an optimal Q function according to the solution of the model a, namely selecting an optimal behavior according to the optimal Q function under the system state at any time t to maximize the accumulated reward, wherein the calculation formula of the optimal Q function is as follows:

at any time t, the optimal behavior selection formula is as follows:

wherein Q ^* (s _t ,a _t ) Representing the optimal Q function, s _t+1 Represents the state of the system at the moment t +1, and a represents any action in all actions that the system may take at the moment t +1, namely, the action space

Is performed in a manner such that a certain behavior in (2),

is shown at s _t+1 In the state, the system adopts any one

Behavior of (1), maximum obtainable optimal Q functionNumerical values.

And b, solving an algorithm, namely, calculating to obtain an optimal Q function and selecting an optimal behavior in the decision so as to maximize the accumulated reward. The reinforcement learning model solving algorithm is an array algorithm, a two-dimensional array (row index is a state and column index is a behavior) is adopted to store the Q function, and a Q sample value Q is calculated _t+1,target And Q query value Q _t (s _t ,a _t ) Difference of delta _t+1 And iteratively updating the Q value in the array, calculating an optimal Q function, and further selecting an optimal behavior by inquiring the array, so that the accumulated reward of the model is maximized. Wherein the Q sample value is calculated according to the optimal Q function, and R obtained by the real-time system _t+1 And s _t+1 Calculated, Q query value is s obtained in real time according to the system _t And a _t And searching the value obtained by the corresponding row and column in the two-dimensional array.

The Q sample value calculation formula is as follows:

wherein

For time t said two-dimensional array s _t+1 Corresponding to the maximum Q query value in the row, the array updating mode is as follows:

wherein Q _t (s _t ,a _t ) For s in a two-dimensional array at time t _t And a _t Corresponding Q query value, Q _t+1 (s _t ,a _t ) For s in a two-dimensional array at time t +1 _t And a _t Corresponding Q query value, beta(s) _t ,a _t )∈[0,1]A corresponding learning step size for each state-behavior pair in the array.

And 3, solving the model by adopting a reinforcement learning model solving algorithm, and adjusting the rotating speed of the fan on the active ventilation floor by continuously exploring and learning the complex relation between the duty ratio of the PWM signal and the rise, fall or maintenance of the duty ratio by using whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as evaluation standards, so that the temperature distribution of the air inlet of the rack is uniform, and the hot spot problem of the rack is relieved. The running logic of the system at the PC end is as follows:

1: setting a reference temperature

Initializing beta(s) _t ,a _t ) (ii) a Initializing the array;

2: setting an initial time t to be 0; exploring a probability change interval random _ slots; probability of exploration of initial behavior, e, rate of exploration, delta, of decrease with t _ε Minimum exploration probability ε _min ；

3: selecting an initial state s ₀ ＝max(DC)；

4: beginning of circulation body

5: if t is less than random _ slots, randomly selecting and switching 7 the behavior from the behavior space, otherwise switching 6;

6: using the exploration probability epsilon as epsilon-delta _ε And epsilon _min And selecting a behavior according to the following formula:

7: execution of a _t (PC sends duty cycle command to microcontroller) and gets the next state s of the system _t+1 (PC sends temperature request command to obtain rack temperature distribution), and calculates R according to reward formula _t+1 ；

8: updating the corresponding value in the array according to a formula;

9: the time t is increased by 1;

10: the cycle body is ended.

In summary, the present invention establishes a markov decision process model for the rack hotspot problem of the lifting floor structure data center, and provides a reinforcement learning model solving algorithm as the core of a reinforcement learning control algorithm, and intelligently controls the fan rotation speed of an active ventilation floor (a floor with a fan attached to the back of a common ventilation floor) according to the current rack temperature distribution on the premise of not increasing the air conditioning power of a machine room, so that the rack inlet temperature distribution is homogenized by the way of actively conveying sufficient amount of cold air, the rack hotspot problem commonly existing in the lifting floor structure data center is alleviated, thereby saving refrigeration energy consumption, and ensuring the safety and stability of a server. Compared with the existing data center rack-level airflow management method, the method is easier to deploy, more cost-effective and stronger in universality.

Claims

1. The reinforcement learning control method of the data center active ventilation floor is characterized by comprising the following steps:

step 1, arranging a certain number of first temperature sensors for monitoring the temperature distribution of an air inlet of a rack at the air inlet of the rack, and arranging a second temperature sensor for monitoring the air supply temperature under an active ventilation floor under the active ventilation floor;

step 2, establishing a Markov decision process model for the rack hot spot problem of the raised floor structure data center, wherein the model is determined by a system state s _t Behavior space

Reward R _t+1 And a cost function Q(s) _t ,a _t ) The four parts are formed;

wherein: the system state s _t For the system state at the time t,

the state space is defined as the duty ratio of a discretized PWM signal square wave, and the formula is as follows:

wherein s is

DC is the value of the square wave duty ratio of the PWM signal, max (DC) is the maximum value of DC, D _TQ For DC discretization of the equivalence ratio, k denotes D in a certain state _TQ The number of (2); the PWM signal square wave is generated by the following method: generating a duty ratio value of a PWM signal according to the temperature distribution returned by the first temperature sensor, and transmitting the duty ratio value to the microcontroller, wherein the microcontroller generates a corresponding PWM signal according to the duty ratio value;

space of action

Defined as the change in the rotational speed of the active ventilation floor fan,

reward R _t+1 The temperature distribution uniformity of the air inlet of the rack is quantified, and the energy consumption of the active ventilation floor fan is calculated according to the following formula:

for the reference temperature of the rack at time t,

T _t,under is tReading, delta, of the second temperature sensor at the moment _T The fixed temperature difference set according to the mixing degree of the cold air and the hot air on the active ventilation floor is positive,

is a collection of the first temperature sensors,

the total number of the temperature sensors is one; - (A) _ref ×DC _t ) ³ Representing the active ventilation floor fan energy consumption, the values of the formula are all negative, the closer to 0, the lower the fan energy consumption, wherein A _ref To maintain a reference behavior value of the same order of magnitude as the uniformity of the temperature distribution at the inlet of the frame, DC _t The duty ratio of the square wave of the PWM signal at the moment t;

cost function Q(s) _t ,a _t ) The formula of the behavior cost function is as follows:

wherein the merit function Q (s, a) is referred to as the Q function,

for the action taken by the system at time t,

as a function of the expectation, y is the future time relative to time t, R _t+y+1 Represents the reward obtained after the system takes action at the time t + y, gamma represents the attenuation factor, gamma is more than or equal to 0 and less than 1, and gamma is ^y Y power of gamma, is t + y time R _t+y+1 The attenuation factor of (d);

the markov decision process model is summarized as: under the system state at any time t, the accumulated reward is maximized by selecting the optimal behavior, and the model formula is as follows:

is constrained to

Wherein, γ ^t Is time t system R _t+1 The attenuation factor of (d);

and 3, solving the model by adopting a reinforcement learning model solving algorithm, and adjusting the rotating speed of the fan of the active ventilation floor by continuously exploring and learning the complex relation between the duty ratio of the PWM signal and the rise, fall or maintenance of the duty ratio by using whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as evaluation standards, so that the temperature distribution of the air inlet of the rack is uniform, and the hot spot problem of the rack is relieved.

2. The reinforcement learning control method for the active ventilation floor of the data center according to claim 1, wherein in the step 2, an optimal Q function is obtained through calculation, that is, an optimal behavior can be selected according to the optimal Q function under a system state at any time t, so that the accumulated reward is maximized, and the optimal Q function has a calculation formula:

at any time t, the optimal behavior selection formula is as follows:

Is performed in a manner such that a certain behavior in (2),

is shown at s _t+1 In the state, the system adopts any one

The largest optimal Q function value can be obtained.

3. The reinforcement learning control method for the active ventilation floor of the data center according to claim 1, wherein in the step 3, the reinforcement learning model solving algorithm is an array algorithm, the Q function is stored by using a two-dimensional array, wherein a row index is a state and a column index is a behavior, and the Q sample value Q is calculated _t+1,target And Q query value Q _t (s _t ,a _t ) Difference of delta _t+1 Iteratively updating the Q value in the array, calculating an optimal Q function, and further selecting an optimal behavior by inquiring the array so as to maximize the accumulative reward of the model; wherein the Q sample value is calculated according to the optimal Q function, and R obtained by the real-time system _t+1 And s _t+1 Calculated, the Q query value is s obtained in real time according to the system _t And a _t Searching the value obtained by the corresponding row and column query in the two-dimensional array;

the Q sample value calculation formula is as follows:

wherein