CN113551373A - Data center air conditioner energy-saving control method based on federal reinforcement learning - Google Patents

Data center air conditioner energy-saving control method based on federal reinforcement learning Download PDF

Info

Publication number
CN113551373A
CN113551373A CN202110812270.2A CN202110812270A CN113551373A CN 113551373 A CN113551373 A CN 113551373A CN 202110812270 A CN202110812270 A CN 202110812270A CN 113551373 A CN113551373 A CN 113551373A
Authority
CN
China
Prior art keywords
energy
air
data center
learning
saving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110812270.2A
Other languages
Chinese (zh)
Inventor
魏清
庄建
胡凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Zhongkun Data Technology Co ltd
Original Assignee
Jiangsu Zhongkun Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Zhongkun Data Technology Co ltd filed Critical Jiangsu Zhongkun Data Technology Co ltd
Priority to CN202110812270.2A priority Critical patent/CN113551373A/en
Publication of CN113551373A publication Critical patent/CN113551373A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/30Control or safety arrangements for purposes related to the operation of the system, e.g. for safety or monitoring
    • F24F11/46Improving electric energy efficiency or saving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H05ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
    • H05KPRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
    • H05K7/00Constructional details common to different types of electric apparatus
    • H05K7/20Modifications to facilitate cooling, ventilating, or heating
    • H05K7/20709Modifications to facilitate cooling, ventilating, or heating for server racks or cabinets; for data centers, e.g. 19-inch computer racks
    • H05K7/20836Thermal management, e.g. server temperature control

Abstract

The data center air conditioner energy-saving control method based on the federal reinforcement learning comprises the following steps: based on two modules: the first module is used for performing reinforcement learning on a single air-conditioning system to perform energy-saving optimization, and the second module is used for performing energy-saving optimization control by combining a plurality of air-conditioning refrigeration systems through federal learning; 1) modeling a single air-conditioning refrigeration system by a Markov decision process; the refrigerating system machine room of the data center adopts an airflow organization form that a water chilling unit supplies air from a floor and returns air from a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component part for saving energy of the data center, and the basic aim of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area; 2) establishing a database; 3) depth-deterministic policy gradient algorithm: 4) a federal mean learning algorithm; and combining a plurality of energy-saving optimized air-conditioning refrigeration system intelligent agents and two networks of the training DDPG to obtain a global model with strong generalization performance.

Description

Data center air conditioner energy-saving control method based on federal reinforcement learning
The technical field is as follows:
the invention relates to the field of artificial intelligence of data centers, in particular to a data center air conditioner energy-saving control method based on federal reinforcement learning.
Background art:
in order to realize sustainable development for responding to climate change, energy conservation of a data center becomes a key point for building a resource-saving society. The basic energy-saving method for the air conditioning system of the data center comprises three aspects of reducing cold/heat load, using efficient equipment and technology, and optimizing system design and control. Wherein the control optimization effect of the system is closely related to the operation energy consumption of the data center. According to statistics, the life cycle of the data center is usually more than dozens of years, the energy consumption of the operation stage is the highest in the whole life cycle, and the energy-saving potential of the stage is huge. The air conditioning system controller is used for dynamically adjusting a set value or an operation rule in the system aiming at the indoor load which changes constantly in the data center, so that the energy consumption and the operation cost of the air conditioning system are reduced as far as possible on the premise of meeting the temperature required by machines in a controlled area. However, some existing controllers are designed to build a complex physical model and carefully select relevant control variables. The control strategy is also static in general, that is, relevant experts and a large amount of prior knowledge are needed to determine a fixed control strategy, and the air conditioning systems of a general data center are managed in a unified manner, so that individual servers with higher and lower energy consumption cannot be considered, and the phenomenon of energy redundancy or shortage can occur.
If a model-free federal reinforcement learning method exists, firstly, an air conditioning system can be controlled in a data-driven mode without establishing a complex physical model; and secondly, the control strategy can be adaptively adjusted by combining a plurality of air conditioning systems according to the external environment of the air conditioning systems and the self factors on the premise of protecting data privacy, the distributed management can be performed on the air conditioners in the data center, the energy consumption is reduced as much as possible, and the energy-saving effect is improved.
In the field of deep reinforcement learning, because the overlapping part of the user features is small and the training data is limited, a high-quality learning method is difficult to find. Although the previous migration learning has succeeded in the field of deep reinforcement learning, it is privacy-violating that the migration learning directly transfers data and models among various parties. Therefore, a method for protecting the privacy of the data model, namely federal deep learning, is provided. In federal learning, when the local model of the other party is updated, the shared information is processed by Gaussian differential, so that the purpose of protecting privacy is achieved. And the federal learning framework was evaluated in two dimensions in the test; in the field of deep reinforcement learning, because the overlap of user features is small and training data is limited, and in fact, the data of a user is information sensitive, it is difficult for an information center to establish a high-quality model. And (4) carrying out federated deep learning, and training a classification or clustering model through data of a plurality of users. The advantage of federal deep learning is that each client only shares limited information, and the information is encrypted when being sent and decoded when being received.
According to the prior art data, the inventor thinks that the reinforcement learning can dynamically adjust the control strategy without relying on a large amount of priori knowledge and a complex model to optimize the controlled object, and the reinforcement learning optimization effect can be compared with or even surpass the control method based on an expert system when enough effective data is available; further, the data center comprises a plurality of servers, the heating values of the servers are inconsistent and influence each other, in order to be considered as comprehensive as possible, the distributed learning mode of federal learning is utilized to cooperatively control a plurality of air conditioning systems, and the data center air conditioning systems are jointly optimized while data privacy is protected. In the prior art, no relevant solution is found.
The invention content is as follows:
the invention aims to provide a data center air conditioner energy-saving control method based on federal reinforcement learning, aiming at the problems of the conventional federal learning algorithm and the energy-saving optimization of a data center air conditioner based on the conventional artificial intelligence technologies such as federal learning and reinforcement learning. The method improves the energy-saving effect of the air conditioner of the data center by using the federal reinforcement learning and the federal learning.
The technical scheme of the invention is that the data center air conditioner energy-saving control method based on the federal reinforcement learning comprises the following steps: the invention firstly proposes that the reinforcement learning is carried out model-free to optimize the operation of a single air conditioning system, and then the invention utilizes the federal learning to optimize the whole data center air conditioning system in consideration of the problem of mutual influence of data center servers. The characteristics of reinforcement learning and federal learning are combined, so that the energy-saving effect of a single air conditioner can be improved, and the air conditioning system of the whole data center can be optimized under the condition of protecting data privacy. The innovation of the invention is mainly based on two modules. The first module is used for carrying out reinforcement learning on a single air conditioning system, and the second module is used for carrying out energy-saving optimization control by combining a plurality of air conditioning systems through federal learning;
module 1: and carrying out energy-saving optimization on the single air conditioning system through reinforcement learning. The energy-saving optimization method comprises the following steps:
step 1: the refrigeration system is modeled with a markov decision process. A refrigerating system machine room of the data center adopts an airflow organization mode that a water chilling unit supplies air from a floor and returns air through a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component for energy saving of the data center, and the basic goal of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area.
As shown in fig. 1, the following relationships exist between the components of the refrigeration system: between cooling water set and data center computer lab: chilled water is generated by the water chilling unit, is conveyed to the server air-water heat exchanger in the data center machine room through the chilled water circulating pump, is heated after passing through the server air-water heat exchanger, and carries heat back to the water chilling unit. Between the water chilling unit and the cooling tower: high-temperature cooling water is sent to a cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after wind-water heat exchange of the cooling tower and returns to a water chilling unit to complete circulation; the strategy optimization problem of the data center cold water system is modeled as a Markov decision process, and can be described as a quadruple { S, A, P, R }. Wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the instantaneous return available for taking different control actions a in state S. A markov decision model for a refrigeration system is shown in figure 2.
Step 1.1 determining Markov quadruple variables
The state space S. And selecting indoor temperature, indoor relative humidity, outdoor temperature and outdoor relative humidity as state space parameters. In order to train out a perfect decision, the invention considers the influence factors of the past and the present, specifically, the present time t is taken as the present state, and the former two times t-1 and t-2 are taken as the past state. The action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions. The transition probability P depends on the true state of the system environment after the control action is performed, and the agent needs to make an unbiased estimate by relying on multiple monte carlo sample alignments. The immediate reward R is shown in equation 1.
Figure BDA0003168822780000031
Wherein Power(s)t,at) And (4) the energy consumption of the cold source system at the moment t, Temp is the punishment of the temperature of the machine room crossing a safety boundary, and beta is a punishment coefficient. The invention is researched and researched to find that the failure rate of the data center is lowest when the temperature is controlled to be 15-20 ℃, so that the data center is used as the temperature requirement of a controlled area.
Figure BDA0003168822780000032
Represents the lower safe temperature limit of the controlled area,Ythe upper safety temperature limit of the controlled zone is indicated at 20.
Temp=log(1+exp(Temperature)2) (2)
Wherein the Temperature represents the Temperature in the cabinet corresponding to a single air conditioner.
And 2, establishing a database. The database receives data collected by various sensors and converts the data into(s) through the quadruple in the step 1t,at,rt,st+1) The data format of the form is stored in the corresponding file of the database after the conversion is finished, the upper limit of the database is set to be Top, and the earliest stored data is discarded when the number of the files reaches the upper limit.
Step 3, directly carrying out parameterization representation on the intelligent agent strategy pi in Deep Deterministic Policy Gradient (DDPG) algorithm reinforcement learning, and learning the mapping relation of environment and action, wherein a state value function of the mapping relation can be represented as:
Vπ(s;ω)=E[∑t=0r(st,at);πω]=∑τP(τ;ω)R(τ) (3)
where ω represents a learnable parameter, πωRepresenting a parameterized policy; r(s)t,at) Representing the instant report at the current moment; τ represents a set of state-motion trajectory sequences; p (tau; omega) represents the probability of the occurrence of the track sequence tau; r (τ) represents the cumulative reward for the trace sequence τ.
Step 3.1 derives from equation 3. As shown in equation 4:
Figure BDA0003168822780000041
when using the strategy piωAfter sampling n tracks, the average experience of the n tracks can be used to approximate the strategy gradient, as shown below.
Figure BDA0003168822780000042
Wherein
Figure BDA0003168822780000043
Is the gradient direction of the ith trace. Ri(τ) represents the step size of the parameter update.
Step 3.2 defines a deterministic policy gradient. In the Q-learning algorithm, the state-action value function represents the expected value of the accumulated reward at the state-action as a function of Q, as shown below.
Figure BDA0003168822780000044
Wherein EπRepresents the accumulated reward expectation, and gamma e (0,1) is the discount factor, representing the instant reward along with the systemDegree of attenuation of state transition, rtFor instant reporting, a0Indicates the initial movement, s0Indicating the initial state and t the system time. Using a deterministic strategy muω(a | s) replaces the probability P in the formula 5, and replaces the trajectory return with a Q-value function, so as to obtain a deterministic strategy gradient calculation formula:
Figure BDA0003168822780000045
and 3.3, building the DDPG network. The DDPG algorithm is realized by adopting a framework of action (Actor) and comment (Critic), wherein the Actor is responsible for giving actions in different environment states, the Critic represents the evaluation of the quality degree of corresponding actions, and the relationship between the action and the Critic is shown in FIG. 3. The Actor network and the Critic network in the invention are both built by a 3-layer neural network. Wherein the Actor network input layer comprises 200 neurons, the middle hidden layer comprises 300 neurons, and the output layer comprises 3 neurons. The Critic network has 400 neurons in the input layer, 300 neurons in the middle hidden layer and 1 neuron in the output layer. As shown in fig. 4
And 3.4, updating the parameters.
The parameter update formula is as follows:
θt+1=θtθ[rt+γQ(st+1,at+1)-Q(st,at)] (8)
Figure BDA0003168822780000046
where θ is the weight parameter of Critic network, ω is the weight parameter of Actor network, αθIs the learning rate of the Critic network, the invention will set alphaθIs 0.005, alphaωIs the learning rate of the Actor network, the invention converts alphaωSet to 0.0005.
And 4, carrying out a federal average learning algorithm. And combining the two networks of the DDPG in the joint training of a plurality of agents to obtain a global model with strong generalization performance. Under the global model of the client in each round of communicationTraining the local model by respective local data based on the initial model parameters, and then updating the parameters, i.e., the parameter θ in equation 8 and equation 9t+1And ωt+1And sending the local model parameters back to the terminal server, and performing average aggregation on the local model parameters on the terminal server to continuously update the global model. As shown in the following equation. Only model parameters are transmitted in the communication process of the local model and the global model, and the basic framework of the federal mean-learning algorithm is shown in fig. 6.
Figure BDA0003168822780000051
Figure BDA0003168822780000052
Wherein
Figure BDA0003168822780000053
Is a weight parameter for each agent Actor network,
Figure BDA0003168822780000054
is the weight parameter of each agent Critic network.
Has the advantages that: the control strategy can be dynamically adjusted based on the federal deep reinforcement learning, the problem of energy-saving optimization of the data center air conditioner is solved through the traditional federal learning algorithm, a controlled object is optimized without depending on a large amount of prior knowledge and a complex model, and the reinforcement learning optimization effect can be compared with or even surpass that of a control method based on an expert system when enough effective data is provided; the data center comprises a plurality of servers, the heating values of the servers are inconsistent and influence each other, in order to be considered as comprehensive as possible, the distributed learning mode of federal learning is utilized to cooperatively control a plurality of air conditioning systems, and the data center air conditioning systems are jointly optimized while data privacy is protected.
Drawings
FIG. 1 is a schematic diagram of a data center refrigeration system according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a Markov decision process model for a cold source according to an embodiment of the present invention;
FIG. 3 is a diagram of an action (Actor) and comment (Critic) framework of an embodiment of the present invention;
fig. 4 is a schematic diagram of an Actor network according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Critic network structure according to an embodiment of the present invention;
FIG. 6 is a block diagram of a federated average learning algorithm framework according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention provides a data center air conditioner energy-saving control method based on federal reinforcement learning, aiming at the problems of the conventional federal learning algorithm and the energy-saving optimization of a data center air conditioner based on the conventional artificial intelligence technologies such as federal learning and reinforcement learning. The method improves the energy-saving effect of the air conditioner of the data center by utilizing reinforcement learning and federal learning. The invention firstly proposes that the reinforcement learning is carried out model-free to optimize the operation of a single air conditioning system, and then the invention utilizes the federal learning to optimize the whole data center air conditioning system in consideration of the problem of mutual influence of data center servers. The characteristics of reinforcement learning and federal learning are combined, so that the energy-saving effect of a single air conditioner can be improved, and the air conditioning system of the whole data center can be optimized under the condition of protecting data privacy. The innovation part of the invention is mainly divided into two modules. The first module is to perform reinforcement learning on a single air conditioning system. The second module is used for performing energy-saving optimization control by combining a plurality of air conditioning systems by using federal learning.
Module 1: and carrying out energy-saving optimization on the single air conditioning system through reinforcement learning.
Step 1: modeling a Markov decision process for a refrigeration system. The machine room of the data center adopts an airflow organization form of floor air supply and ceiling air return, the operation optimization of the cold channel closed air conditioning system is an important component part for energy saving of the data center, and the basic goal of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area.
As shown in fig. 1, the agent K represents a terminal precision air conditioner, and the following relationship exists between the refrigeration system configurations. Between cooling water set and data center computer lab: the water chilling unit generates chilled water, the chilled water is conveyed to a data center machine room through a chilled water circulating pump, and the chilled water is heated after exchanging heat with wind water of the server and carries heat back to the water chilling unit. Between the water chilling unit and the cooling tower: high-temperature cooling water is sent to the cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after heat exchange of wind and water, and returns to the water chilling unit to complete circulation. The method models the strategy optimization problem of the data center cold water system into a Markov decision process, and can be described as a quadruple { S, A, P, R }. Wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the instantaneous return available for taking different control actions a in state S. A markov decision model for a refrigeration system is shown in figure 2.
Step 1.1 determining Markov quadruple variables
The state space S. And selecting four parameters of indoor temperature T _ indor, indoor relative humidity H _ indor, outdoor temperature T _ outdoor, outdoor relative humidity H _ outdoor as state space parameters. In order to train out a perfect control decision, the invention considers the influence factors of the past and the present, specifically, the present time t is taken as the present state, and the former two times t-1 and t-2 are taken as the past state. The action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions. The transition probability P depends on the real state of the system environment after the control action is executed, and the air-conditioning refrigeration system needs to make an unbiased estimation on the system environment by means of multiple Monte Carlo sampling, and is an algorithm parameter needing to be learned. The immediate reward R is responsible for evaluating the action. This patent evaluation air conditioning system operational aspect mainly has the index of two aspects, and one is that whole system is the required energy consumption in the operation, and second is the guarantee system safety and reaches the requirement temperature. As shown in equation 1.
Figure BDA0003168822780000061
Wherein Power(s)t,at) And (4) the energy consumption of the cold source system at the moment t, Temp is the punishment of the temperature of the machine room crossing a safety boundary, and beta is a punishment coefficient. The invention is researched and researched to find that the failure rate of the data center is lowest when the temperature is controlled to be 15-20 ℃, so that the data center is used as the temperature requirement of a controlled area.
Figure BDA0003168822780000071
Represents the lower safe temperature limit of the controlled area,Ythe upper safety temperature limit of the controlled zone is indicated at 20. Power(s)t,at) The energy consumption of the air conditioning system at time t is negative in order to reduce the energy consumption as much as possible.
Temp=log(1+exp(Temperature)2) (2)
Wherein the Temperature represents the Temperature of the controlled area, namely the Temperature of the machine room.
And 2, establishing a database. The database receives data collected by various sensors and converts the data into(s) through the quadruple in the step 1t,at,rt,st+1) The data format of the form is stored in the corresponding file of the database after the conversion is finished, the upper limit of the database is set to be Top, and the earliest stored data is discarded when the number of the files reaches the upper limit.
Step 3, in the strategy Gradient algorithm of the Deep Deterministic Policy Gradient (DDPG) algorithm, the air conditioning system directly parameterizes and represents the control strategy pi, learns the mapping relationship between the environment and the action, and the state value function of the mapping relationship can be represented as:
Vπ(s;ω)=E[∑t=0r(st,at);πω]=∑τP(τ;ω)R(τ) (3)
where ω represents a learnable parameter, πωRepresenting a parameterized policy; r(s)t,at) Representing the instant report at the current moment; τ represents a set of state-motion trajectory sequences; p (tau; omega) represents the probability of the occurrence of the track sequence tau; r (τ) represents the cumulative reward for the trace sequence τ.
Step 3.1 derives from equation 3. As shown in equation 4:
Figure BDA0003168822780000072
when using the strategy piωAfter sampling n tracks, the average experience of the n tracks can be used to approximate the strategy gradient, as shown below.
Figure BDA0003168822780000073
Wherein
Figure BDA0003168822780000074
Is the gradient direction of the ith trace. Ri(τ) represents the step size of the parameter update.
Step 3.2 defines a deterministic policy gradient. In the Q-learning algorithm, the state-action value function represents the expected value of the accumulated reward at the state-action as a function of Q, as shown below.
Figure BDA0003168822780000075
Wherein EπRepresenting cumulative reward expectation, γ ∈ (0,1) is a discount factor representing the degree of decay of immediate reward with system state transitions, rtFor instant reporting, a0Indicates the initial movement, s0Indicating the initial state and t the system time. Using a deterministic strategy muω(a | s) replaces the probability P in the formula 5, and replaces the trajectory return with a Q-value function, so as to obtain a deterministic strategy gradient calculation formula:
Figure BDA0003168822780000081
and 3.3, building the DDPG network. The DDPG algorithm is realized by adopting a framework of action (Actor) and comment (Critic), wherein the Actor is responsible for giving actions in different environment states, the Critic represents the evaluation of the quality degree of corresponding actions, and the relationship between the action and the Critic is shown in FIG. 3. The Actor network and the Critic network in the invention are both built by a 3-layer neural network. Wherein the Actor network input layer comprises 200 neurons, the middle hidden layer comprises 300 neurons, and the output layer comprises 3 neurons. The Critic network has 400 neurons in the input layer, 300 neurons in the middle hidden layer and 1 neuron in the output layer. As shown in fig. 4
And 3.4, updating parameters, after a period of data, randomly extracting N pieces of data from the database, and updating each neural network parameter by using a gradient descent method to achieve the aim of improving the control strategy of the air conditioning system.
The parameter update formula is as follows:
θt+1=θtθ[rt+γQ(st+1,at+1)-Q(st,at)] (8)
Figure BDA0003168822780000082
where θ is the weight parameter of Critic network, ω is the weight parameter of Actor network, αθIs the learning rate of the Critic network, the invention will set alphaθIs 0.005, alphaωIs the learning rate of the Actor network, the invention converts alphaωSet to 0.0005.
And 4, carrying out a federal average learning algorithm. And (4) performing joint training by combining a plurality of air conditioning systems to obtain two global DDPG network models with strong generalization performance, namely an Actor network and a Critic network. The client downloads the initial model parameters from the global model in each communication round, trains the local models through respective local data, and then updates the parameters, namely the parameter theta in formula 8 and formula 9t+1And ωt+1Sending back to the terminal server, at the terminalThe local model parameters are averaged on the end server to continuously update the global model. As shown in the following equation. Only model parameters are transmitted in the communication process of the local model and the global model, and the basic framework of the federal mean-learning algorithm is shown in fig. 6.
Figure BDA0003168822780000083
Figure BDA0003168822780000084
Wherein
Figure BDA0003168822780000085
Is a weight parameter for each agent Actor network,
Figure BDA0003168822780000086
is the weight parameter of each agent Critic network.
The overall flow chart of the invention is shown in fig. 7.

Claims (2)

1. The data center air conditioner energy-saving control method based on the federal reinforcement learning is characterized by comprising the following steps of:
based on two modules: the first module is used for performing reinforcement learning on a single air-conditioning system to perform energy-saving optimization, and the second module is used for performing energy-saving optimization control by combining a plurality of air-conditioning refrigeration systems through federal learning;
module 1: a module for performing energy-saving optimization on a single air conditioning system through reinforcement learning; the energy-saving optimization method comprises the following steps:
step 1: modeling a single air-conditioning refrigeration system by a Markov decision process; the refrigerating system machine room of the data center adopts an airflow organization form that a water chilling unit supplies air from a floor and returns air from a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component part for saving energy of the data center, and the basic aim of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area;
the following relationships exist between the components of the refrigeration system: between cooling water set and data center computer lab: chilled water is generated by the water chilling unit, is conveyed to a server air-water heat exchanger in a data center machine room through a chilled water circulating pump, is heated after passing through the server air-water heat exchanger, and carries heat back to the water chilling unit; between the water chilling unit and the cooling tower: high-temperature cooling water is sent to a cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after wind-water heat exchange of the cooling tower and returns to a water chilling unit to complete circulation; modeling the strategy optimization problem of the data center cold water system into a Markov decision process, and describing the strategy optimization problem as a quadruple { S, A, P, R }: wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the real-time reward that can be obtained by taking different control actions A in the state S;
step 1.1 determining Markov quadruple variables
Selecting indoor temperature, indoor relative humidity, outdoor temperature and outdoor relative humidity as state space parameters; taking the current time t as the current state, and taking the previous two times t-1 and t-2 as the past states; the action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions; the transition probability P depends on the real state of the system environment after the control action is executed, and unbiased estimation needs to be made by means of multiple Monte Carlo sampling alignment; the immediate reward R is shown in formula (1):
Figure FDA0003168822770000011
wherein Power(s)t,at) The energy consumption of the cold source system is at the moment t, Temp is the punishment that the temperature of the machine room crosses a safety boundary, and beta is a punishment coefficient; controlling the temperature at 15-20 ℃ as the temperature requirement of a controlled area:
Figure FDA0003168822770000012
represents the lower safe temperature limit of the controlled area,Y20 represents the upper safe temperature limit of the controlled area;
Temp=log(1+exp(Temperature)2) (2)
wherein the Temperature represents the Temperature in the cabinet corresponding to the single air conditioner;
step 2, establishing a database, wherein the database receives data acquired by various sensors and converts the data into(s) through the quadruple in the step 1t,at,rt,st+1) The data format of the form is stored in a corresponding file of a database after conversion is finished, the upper limit of the number of files of the database is set to be Top, and the earliest stored data is discarded when the number of files reaches the upper limit;
step 3 Depth Deterministic Policy Gradient (DDPG) algorithm:
in reinforcement learning, the strategy pi of the intelligent agent is directly represented in a parameterization mode, the mapping relation of a learning environment and action is represented by a state value function:
Vπ(s;ω)=E[∑t=0r(st,at);πω]=∑τP(τ;ω)R(τ) (3)
where ω denotes a learning parameter, πωRepresenting a parameterized policy; r(s)t,at) Representing the instant report at the current moment; τ represents a set of state-motion trajectory sequences; p (tau; omega) represents the probability of the occurrence of the track sequence tau; r (tau) represents the accumulated return of the track sequence tau;
step 3.1 derives from equation 3, as shown in equation 4:
Figure FDA0003168822770000021
when using a parameterized strategy piωAfter sampling n tracks, the average experience of the n tracks can be used to approximate the strategy gradient, as shown below;
Figure FDA0003168822770000022
wherein
Figure FDA0003168822770000023
Is the gradient direction of the ith trace, Ri(τ) represents a step size of the parameter update;
step 3.2 defining a deterministic policy gradient; in the Q-learning algorithm, the state-action value function represents the expected value of the accumulated reward at the state-action as an O-value function, as shown below;
Figure FDA0003168822770000024
wherein EπRepresenting cumulative reward expectation, γ ∈ (0,1) is a discount factor representing the degree of decay of immediate reward with system state transitions, rtFor instant reporting, a0Indicates the initial movement, s0Represents an initial state, and t represents a system time; using a deterministic strategy muω(a | s) replaces the probability P in the formula (5) and replaces the trajectory return with a Q value function, so that a deterministic strategy gradient calculation formula can be obtained:
Figure FDA0003168822770000025
and 3.3, building the DDPG network. The DDPG algorithm is realized by adopting a framework of action (Actor) and comment (Critic), wherein the Actor is responsible for giving actions in different environment states, and the Critic expresses the quality degree of the corresponding action; the Actor network and the Critic network are both neural networks, the Actor network output layer comprises 3 neurons, and the Critic network output layer comprises 1 neuron;
step 3.4, updating parameters;
the parameter update formula is as follows:
θt+1=θtθ[rt+γQ(st+1,at+1)-Q(st,at)] (8)
Figure FDA0003168822770000031
where θ is the weight parameter of Critic network, ω is the weight parameter of Actor network, αθIs the learning rate of the Critic network, setting αθIs 0.005, alphaωIs the learning rate of the Actor network, will be αωSet to 0.0005;
the energy-saving optimization control of the plurality of air-conditioning refrigeration systems is a program which adopts a second module and combines the plurality of air-conditioning refrigeration systems by federal learning to perform energy-saving optimization control:
step 4, a federal average learning algorithm; combining a plurality of energy-saving optimized air-conditioning refrigeration system intelligent agents and training two networks of DDPG to obtain a global model with strong generalization performance; the client downloads the initial model parameters from the global model in each communication round, trains the local models through respective local data, and then updates the parameters, namely, the parameter theta in the formula (8) and the formula (9)t+1And ωt+1Sending back to the terminal server, and performing average aggregation on local model parameters on the terminal server to continuously update the global model; as shown in the following equation; only model parameters are transmitted in the communication process of the local model and the global model;
Figure FDA0003168822770000032
Figure FDA0003168822770000033
wherein
Figure FDA0003168822770000034
Is a weight parameter of the kth air conditioning system Actor network,
Figure FDA0003168822770000035
is the weight parameter of the kth agent Critic network.
2. The data center air conditioner energy-saving control method based on the federal reinforcement learning of claim 1, wherein the Actor network and the Critic network are both constructed by a 3-layer neural network; wherein the Actor network input layer comprises 200 neurons, the middle hidden layer comprises 300 neurons, and the output layer comprises 3 neurons. The Critic network has 400 neurons in the input layer, 300 neurons in the middle hidden layer and 1 neuron in the output layer.
CN202110812270.2A 2021-07-19 2021-07-19 Data center air conditioner energy-saving control method based on federal reinforcement learning Withdrawn CN113551373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110812270.2A CN113551373A (en) 2021-07-19 2021-07-19 Data center air conditioner energy-saving control method based on federal reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110812270.2A CN113551373A (en) 2021-07-19 2021-07-19 Data center air conditioner energy-saving control method based on federal reinforcement learning

Publications (1)

Publication Number Publication Date
CN113551373A true CN113551373A (en) 2021-10-26

Family

ID=78103352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110812270.2A Withdrawn CN113551373A (en) 2021-07-19 2021-07-19 Data center air conditioner energy-saving control method based on federal reinforcement learning

Country Status (1)

Country Link
CN (1) CN113551373A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114017904A (en) * 2021-11-04 2022-02-08 广东电网有限责任公司 Operation control method and device for building HVAC system
CN114330852A (en) * 2021-12-21 2022-04-12 清华大学 Energy-saving optimization method and device for tail end air conditioning system of integrated data center cabinet
CN114901057A (en) * 2022-07-12 2022-08-12 联通(广东)产业互联网有限公司 Multi-point energy consumption detection and dynamic regulation system in data center machine room
WO2023093388A1 (en) * 2021-11-26 2023-06-01 深圳市愚公科技有限公司 Air purifier adjusting method based on reinforcement learning model, and air purifier
CN116294089A (en) * 2023-05-23 2023-06-23 浙江之科云创数字科技有限公司 Air conditioning system control method and device, storage medium and electronic equipment

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114017904A (en) * 2021-11-04 2022-02-08 广东电网有限责任公司 Operation control method and device for building HVAC system
CN114017904B (en) * 2021-11-04 2023-01-20 广东电网有限责任公司 Operation control method and device for building HVAC system
WO2023093388A1 (en) * 2021-11-26 2023-06-01 深圳市愚公科技有限公司 Air purifier adjusting method based on reinforcement learning model, and air purifier
CN114330852A (en) * 2021-12-21 2022-04-12 清华大学 Energy-saving optimization method and device for tail end air conditioning system of integrated data center cabinet
CN114330852B (en) * 2021-12-21 2022-09-23 清华大学 Energy-saving optimization method and device for tail end air conditioning system of integrated data center cabinet
WO2023116742A1 (en) * 2021-12-21 2023-06-29 清华大学 Energy-saving optimization method and apparatus for terminal air conditioning system of integrated data center cabinet
CN114901057A (en) * 2022-07-12 2022-08-12 联通(广东)产业互联网有限公司 Multi-point energy consumption detection and dynamic regulation system in data center machine room
CN114901057B (en) * 2022-07-12 2022-09-27 联通(广东)产业互联网有限公司 Multi-point energy consumption detection and dynamic regulation system in data center machine room
CN116294089A (en) * 2023-05-23 2023-06-23 浙江之科云创数字科技有限公司 Air conditioning system control method and device, storage medium and electronic equipment
CN116294089B (en) * 2023-05-23 2023-08-18 浙江之科云创数字科技有限公司 Air conditioning system control method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN113551373A (en) Data center air conditioner energy-saving control method based on federal reinforcement learning
CN102301288B (en) Systems and methods to control energy consumption efficiency
WO2023093820A1 (en) Device control optimization method, display platform, cloud server, and storage medium
CN113283156B (en) Energy-saving control method for subway station air conditioning system based on deep reinforcement learning
CN112415924A (en) Energy-saving optimization method and system for air conditioning system
CN114383299B (en) Central air-conditioning system operation strategy optimization method based on big data and dynamic simulation
CN112598150B (en) Method for improving fire detection effect based on federal learning in intelligent power plant
CN113039506B (en) Causal learning-based data center foundation structure optimization method
CN112628956B (en) Water chilling unit load prediction control method and system based on edge cloud cooperative framework
CN115220351B (en) Intelligent energy-saving optimization control method for building air conditioning system based on cloud side end
Lissa et al. Transfer learning applied to reinforcement learning-based hvac control
Yu et al. District cooling system control for providing operating reserve based on safe deep reinforcement learning
CN111649457A (en) Dynamic predictive machine learning type air conditioner energy-saving control method
Jiang et al. Deep transfer learning for thermal dynamics modeling in smart buildings
Wang et al. Toward physics-guided safe deep reinforcement learning for green data center cooling control
CN114326987B (en) Refrigerating system control and model training method, device, equipment and storage medium
Feng et al. A fully distributed voting strategy for AHU fault detection and diagnosis based on a decentralized structure
CN114970358A (en) Data center energy efficiency optimization method and system based on reinforcement learning
Sun et al. Energy consumption optimization of building air conditioning system via combining the parallel temporal convolutional neural network and adaptive opposition-learning chimp algorithm
Deng et al. Toward smart multizone HVAC control by combining context-aware system and deep reinforcement learning
CN116907036A (en) Deep reinforcement learning water chilling unit control method based on cold load prediction
Groumpos et al. New advanced technology methods for energy efficiency of buildings
Burger et al. ARX model of a residential heating system with backpropagation parameter estimation algorithm
Lin et al. Optimizing for Large Time Delay Systems by BP Neural Network and Evolutionary Algorithm Improving.
CN110595008A (en) Multi-equipment collaborative optimization method and system for ground source heat pump air conditioning system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211026