CN113551373A

CN113551373A - Data center air conditioner energy-saving control method based on federal reinforcement learning

Info

Publication number: CN113551373A
Application number: CN202110812270.2A
Authority: CN
Inventors: 魏清; 庄建; 胡凯
Original assignee: Jiangsu Zhongkun Data Technology Co ltd
Current assignee: Jiangsu Zhongkun Data Technology Co ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-10-26

Abstract

The data center air conditioner energy-saving control method based on the federal reinforcement learning comprises the following steps: based on two modules: the first module is used for performing reinforcement learning on a single air-conditioning system to perform energy-saving optimization, and the second module is used for performing energy-saving optimization control by combining a plurality of air-conditioning refrigeration systems through federal learning; 1) modeling a single air-conditioning refrigeration system by a Markov decision process; the refrigerating system machine room of the data center adopts an airflow organization form that a water chilling unit supplies air from a floor and returns air from a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component part for saving energy of the data center, and the basic aim of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area; 2) establishing a database; 3) depth-deterministic policy gradient algorithm: 4) a federal mean learning algorithm; and combining a plurality of energy-saving optimized air-conditioning refrigeration system intelligent agents and two networks of the training DDPG to obtain a global model with strong generalization performance.

Description

Data center air conditioner energy-saving control method based on federal reinforcement learning

The technical field is as follows:

the invention relates to the field of artificial intelligence of data centers, in particular to a data center air conditioner energy-saving control method based on federal reinforcement learning.

Background art:

in order to realize sustainable development for responding to climate change, energy conservation of a data center becomes a key point for building a resource-saving society. The basic energy-saving method for the air conditioning system of the data center comprises three aspects of reducing cold/heat load, using efficient equipment and technology, and optimizing system design and control. Wherein the control optimization effect of the system is closely related to the operation energy consumption of the data center. According to statistics, the life cycle of the data center is usually more than dozens of years, the energy consumption of the operation stage is the highest in the whole life cycle, and the energy-saving potential of the stage is huge. The air conditioning system controller is used for dynamically adjusting a set value or an operation rule in the system aiming at the indoor load which changes constantly in the data center, so that the energy consumption and the operation cost of the air conditioning system are reduced as far as possible on the premise of meeting the temperature required by machines in a controlled area. However, some existing controllers are designed to build a complex physical model and carefully select relevant control variables. The control strategy is also static in general, that is, relevant experts and a large amount of prior knowledge are needed to determine a fixed control strategy, and the air conditioning systems of a general data center are managed in a unified manner, so that individual servers with higher and lower energy consumption cannot be considered, and the phenomenon of energy redundancy or shortage can occur.

If a model-free federal reinforcement learning method exists, firstly, an air conditioning system can be controlled in a data-driven mode without establishing a complex physical model; and secondly, the control strategy can be adaptively adjusted by combining a plurality of air conditioning systems according to the external environment of the air conditioning systems and the self factors on the premise of protecting data privacy, the distributed management can be performed on the air conditioners in the data center, the energy consumption is reduced as much as possible, and the energy-saving effect is improved.

In the field of deep reinforcement learning, because the overlapping part of the user features is small and the training data is limited, a high-quality learning method is difficult to find. Although the previous migration learning has succeeded in the field of deep reinforcement learning, it is privacy-violating that the migration learning directly transfers data and models among various parties. Therefore, a method for protecting the privacy of the data model, namely federal deep learning, is provided. In federal learning, when the local model of the other party is updated, the shared information is processed by Gaussian differential, so that the purpose of protecting privacy is achieved. And the federal learning framework was evaluated in two dimensions in the test; in the field of deep reinforcement learning, because the overlap of user features is small and training data is limited, and in fact, the data of a user is information sensitive, it is difficult for an information center to establish a high-quality model. And (4) carrying out federated deep learning, and training a classification or clustering model through data of a plurality of users. The advantage of federal deep learning is that each client only shares limited information, and the information is encrypted when being sent and decoded when being received.

According to the prior art data, the inventor thinks that the reinforcement learning can dynamically adjust the control strategy without relying on a large amount of priori knowledge and a complex model to optimize the controlled object, and the reinforcement learning optimization effect can be compared with or even surpass the control method based on an expert system when enough effective data is available; further, the data center comprises a plurality of servers, the heating values of the servers are inconsistent and influence each other, in order to be considered as comprehensive as possible, the distributed learning mode of federal learning is utilized to cooperatively control a plurality of air conditioning systems, and the data center air conditioning systems are jointly optimized while data privacy is protected. In the prior art, no relevant solution is found.

The invention content is as follows:

the invention aims to provide a data center air conditioner energy-saving control method based on federal reinforcement learning, aiming at the problems of the conventional federal learning algorithm and the energy-saving optimization of a data center air conditioner based on the conventional artificial intelligence technologies such as federal learning and reinforcement learning. The method improves the energy-saving effect of the air conditioner of the data center by using the federal reinforcement learning and the federal learning.

The technical scheme of the invention is that the data center air conditioner energy-saving control method based on the federal reinforcement learning comprises the following steps: the invention firstly proposes that the reinforcement learning is carried out model-free to optimize the operation of a single air conditioning system, and then the invention utilizes the federal learning to optimize the whole data center air conditioning system in consideration of the problem of mutual influence of data center servers. The characteristics of reinforcement learning and federal learning are combined, so that the energy-saving effect of a single air conditioner can be improved, and the air conditioning system of the whole data center can be optimized under the condition of protecting data privacy. The innovation of the invention is mainly based on two modules. The first module is used for carrying out reinforcement learning on a single air conditioning system, and the second module is used for carrying out energy-saving optimization control by combining a plurality of air conditioning systems through federal learning;

module 1: and carrying out energy-saving optimization on the single air conditioning system through reinforcement learning. The energy-saving optimization method comprises the following steps:

step 1: the refrigeration system is modeled with a markov decision process. A refrigerating system machine room of the data center adopts an airflow organization mode that a water chilling unit supplies air from a floor and returns air through a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component for energy saving of the data center, and the basic goal of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area.

As shown in fig. 1, the following relationships exist between the components of the refrigeration system: between cooling water set and data center computer lab: chilled water is generated by the water chilling unit, is conveyed to the server air-water heat exchanger in the data center machine room through the chilled water circulating pump, is heated after passing through the server air-water heat exchanger, and carries heat back to the water chilling unit. Between the water chilling unit and the cooling tower: high-temperature cooling water is sent to a cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after wind-water heat exchange of the cooling tower and returns to a water chilling unit to complete circulation; the strategy optimization problem of the data center cold water system is modeled as a Markov decision process, and can be described as a quadruple { S, A, P, R }. Wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the instantaneous return available for taking different control actions a in state S. A markov decision model for a refrigeration system is shown in figure 2.

Step 1.1 determining Markov quadruple variables

The state space S. And selecting indoor temperature, indoor relative humidity, outdoor temperature and outdoor relative humidity as state space parameters. In order to train out a perfect decision, the invention considers the influence factors of the past and the present, specifically, the present time t is taken as the present state, and the former two times t-1 and t-2 are taken as the past state. The action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions. The transition probability P depends on the true state of the system environment after the control action is performed, and the agent needs to make an unbiased estimate by relying on multiple monte carlo sample alignments. The immediate reward R is shown in equation 1.

Wherein Power(s)_t,a_t) And (4) the energy consumption of the cold source system at the moment t, Temp is the punishment of the temperature of the machine room crossing a safety boundary, and beta is a punishment coefficient. The invention is researched and researched to find that the failure rate of the data center is lowest when the temperature is controlled to be 15-20 ℃, so that the data center is used as the temperature requirement of a controlled area.

Represents the lower safe temperature limit of the controlled area,Ythe upper safety temperature limit of the controlled zone is indicated at 20.

Temp＝log(1+exp(Temperature)²) (2)

Wherein the Temperature represents the Temperature in the cabinet corresponding to a single air conditioner.

And 2, establishing a database. The database receives data collected by various sensors and converts the data into(s) through the quadruple in the step 1_t,a_t,r_t,s_t+1) The data format of the form is stored in the corresponding file of the database after the conversion is finished, the upper limit of the database is set to be Top, and the earliest stored data is discarded when the number of the files reaches the upper limit.

Step 3, directly carrying out parameterization representation on the intelligent agent strategy pi in Deep Deterministic Policy Gradient (DDPG) algorithm reinforcement learning, and learning the mapping relation of environment and action, wherein a state value function of the mapping relation can be represented as:

V_π(s；ω)＝E[∑_t＝0r(s_t,a_t)；π_ω]＝∑_τP(τ；ω)R(τ) (3)

where ω represents a learnable parameter, π_ωRepresenting a parameterized policy; r(s)_t,a_t) Representing the instant report at the current moment; τ represents a set of state-motion trajectory sequences; p (tau; omega) represents the probability of the occurrence of the track sequence tau; r (τ) represents the cumulative reward for the trace sequence τ.

Step 3.1 derives from equation 3. As shown in equation 4:

when using the strategy pi_ωAfter sampling n tracks, the average experience of the n tracks can be used to approximate the strategy gradient, as shown below.

Wherein

Is the gradient direction of the ith trace. R_i(τ) represents the step size of the parameter update.

Step 3.2 defines a deterministic policy gradient. In the Q-learning algorithm, the state-action value function represents the expected value of the accumulated reward at the state-action as a function of Q, as shown below.

Wherein E_πRepresents the accumulated reward expectation, and gamma e (0,1) is the discount factor, representing the instant reward along with the systemDegree of attenuation of state transition, r_tFor instant reporting, a₀Indicates the initial movement, s₀Indicating the initial state and t the system time. Using a deterministic strategy mu_ω(a | s) replaces the probability P in the formula 5, and replaces the trajectory return with a Q-value function, so as to obtain a deterministic strategy gradient calculation formula:

and 3.3, building the DDPG network. The DDPG algorithm is realized by adopting a framework of action (Actor) and comment (Critic), wherein the Actor is responsible for giving actions in different environment states, the Critic represents the evaluation of the quality degree of corresponding actions, and the relationship between the action and the Critic is shown in FIG. 3. The Actor network and the Critic network in the invention are both built by a 3-layer neural network. Wherein the Actor network input layer comprises 200 neurons, the middle hidden layer comprises 300 neurons, and the output layer comprises 3 neurons. The Critic network has 400 neurons in the input layer, 300 neurons in the middle hidden layer and 1 neuron in the output layer. As shown in fig. 4

And 3.4, updating the parameters.

The parameter update formula is as follows:

θ_t+1＝θ_t+α_θ[r_t+γQ(s_t+1,a_t+1)-Q(s_t,a_t)] (8)

where θ is the weight parameter of Critic network, ω is the weight parameter of Actor network, α_θIs the learning rate of the Critic network, the invention will set alpha_θIs 0.005, alpha_ωIs the learning rate of the Actor network, the invention converts alpha_ωSet to 0.0005.

And 4, carrying out a federal average learning algorithm. And combining the two networks of the DDPG in the joint training of a plurality of agents to obtain a global model with strong generalization performance. Under the global model of the client in each round of communicationTraining the local model by respective local data based on the initial model parameters, and then updating the parameters, i.e., the parameter θ in equation 8 and equation 9_t+1And ω_t+1And sending the local model parameters back to the terminal server, and performing average aggregation on the local model parameters on the terminal server to continuously update the global model. As shown in the following equation. Only model parameters are transmitted in the communication process of the local model and the global model, and the basic framework of the federal mean-learning algorithm is shown in fig. 6.

Wherein

Is a weight parameter for each agent Actor network,

is the weight parameter of each agent Critic network.

Has the advantages that: the control strategy can be dynamically adjusted based on the federal deep reinforcement learning, the problem of energy-saving optimization of the data center air conditioner is solved through the traditional federal learning algorithm, a controlled object is optimized without depending on a large amount of prior knowledge and a complex model, and the reinforcement learning optimization effect can be compared with or even surpass that of a control method based on an expert system when enough effective data is provided; the data center comprises a plurality of servers, the heating values of the servers are inconsistent and influence each other, in order to be considered as comprehensive as possible, the distributed learning mode of federal learning is utilized to cooperatively control a plurality of air conditioning systems, and the data center air conditioning systems are jointly optimized while data privacy is protected.

Drawings

FIG. 1 is a schematic diagram of a data center refrigeration system according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a Markov decision process model for a cold source according to an embodiment of the present invention;

FIG. 3 is a diagram of an action (Actor) and comment (Critic) framework of an embodiment of the present invention;

fig. 4 is a schematic diagram of an Actor network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a Critic network structure according to an embodiment of the present invention;

FIG. 6 is a block diagram of a federated average learning algorithm framework according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention provides a data center air conditioner energy-saving control method based on federal reinforcement learning, aiming at the problems of the conventional federal learning algorithm and the energy-saving optimization of a data center air conditioner based on the conventional artificial intelligence technologies such as federal learning and reinforcement learning. The method improves the energy-saving effect of the air conditioner of the data center by utilizing reinforcement learning and federal learning. The invention firstly proposes that the reinforcement learning is carried out model-free to optimize the operation of a single air conditioning system, and then the invention utilizes the federal learning to optimize the whole data center air conditioning system in consideration of the problem of mutual influence of data center servers. The characteristics of reinforcement learning and federal learning are combined, so that the energy-saving effect of a single air conditioner can be improved, and the air conditioning system of the whole data center can be optimized under the condition of protecting data privacy. The innovation part of the invention is mainly divided into two modules. The first module is to perform reinforcement learning on a single air conditioning system. The second module is used for performing energy-saving optimization control by combining a plurality of air conditioning systems by using federal learning.

Module 1: and carrying out energy-saving optimization on the single air conditioning system through reinforcement learning.

Step 1: modeling a Markov decision process for a refrigeration system. The machine room of the data center adopts an airflow organization form of floor air supply and ceiling air return, the operation optimization of the cold channel closed air conditioning system is an important component part for energy saving of the data center, and the basic goal of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area.

As shown in fig. 1, the agent K represents a terminal precision air conditioner, and the following relationship exists between the refrigeration system configurations. Between cooling water set and data center computer lab: the water chilling unit generates chilled water, the chilled water is conveyed to a data center machine room through a chilled water circulating pump, and the chilled water is heated after exchanging heat with wind water of the server and carries heat back to the water chilling unit. Between the water chilling unit and the cooling tower: high-temperature cooling water is sent to the cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after heat exchange of wind and water, and returns to the water chilling unit to complete circulation. The method models the strategy optimization problem of the data center cold water system into a Markov decision process, and can be described as a quadruple { S, A, P, R }. Wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the instantaneous return available for taking different control actions a in state S. A markov decision model for a refrigeration system is shown in figure 2.

Step 1.1 determining Markov quadruple variables

The state space S. And selecting four parameters of indoor temperature T _ indor, indoor relative humidity H _ indor, outdoor temperature T _ outdoor, outdoor relative humidity H _ outdoor as state space parameters. In order to train out a perfect control decision, the invention considers the influence factors of the past and the present, specifically, the present time t is taken as the present state, and the former two times t-1 and t-2 are taken as the past state. The action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions. The transition probability P depends on the real state of the system environment after the control action is executed, and the air-conditioning refrigeration system needs to make an unbiased estimation on the system environment by means of multiple Monte Carlo sampling, and is an algorithm parameter needing to be learned. The immediate reward R is responsible for evaluating the action. This patent evaluation air conditioning system operational aspect mainly has the index of two aspects, and one is that whole system is the required energy consumption in the operation, and second is the guarantee system safety and reaches the requirement temperature. As shown in equation 1.

Represents the lower safe temperature limit of the controlled area,Ythe upper safety temperature limit of the controlled zone is indicated at 20. Power(s)_t,a_t) The energy consumption of the air conditioning system at time t is negative in order to reduce the energy consumption as much as possible.

Temp＝log(1+exp(Temperature)²) (2)

Wherein the Temperature represents the Temperature of the controlled area, namely the Temperature of the machine room.

Step 3, in the strategy Gradient algorithm of the Deep Deterministic Policy Gradient (DDPG) algorithm, the air conditioning system directly parameterizes and represents the control strategy pi, learns the mapping relationship between the environment and the action, and the state value function of the mapping relationship can be represented as:

V_π(s；ω)＝E[∑_t＝0r(s_t,a_t)；π_ω]＝∑_τP(τ；ω)R(τ) (3)

Step 3.1 derives from equation 3. As shown in equation 4:

Wherein

Wherein E_πRepresenting cumulative reward expectation, γ ∈ (0,1) is a discount factor representing the degree of decay of immediate reward with system state transitions, r_tFor instant reporting, a₀Indicates the initial movement, s₀Indicating the initial state and t the system time. Using a deterministic strategy mu_ω(a | s) replaces the probability P in the formula 5, and replaces the trajectory return with a Q-value function, so as to obtain a deterministic strategy gradient calculation formula:

And 3.4, updating parameters, after a period of data, randomly extracting N pieces of data from the database, and updating each neural network parameter by using a gradient descent method to achieve the aim of improving the control strategy of the air conditioning system.

The parameter update formula is as follows:

θ_t+1＝θ_t+α_θ[r_t+γQ(s_t+1,a_t+1)-Q(s_t,a_t)] (8)

And 4, carrying out a federal average learning algorithm. And (4) performing joint training by combining a plurality of air conditioning systems to obtain two global DDPG network models with strong generalization performance, namely an Actor network and a Critic network. The client downloads the initial model parameters from the global model in each communication round, trains the local models through respective local data, and then updates the parameters, namely the parameter theta in formula 8 and formula 9_t+1And ω_t+1Sending back to the terminal server, at the terminalThe local model parameters are averaged on the end server to continuously update the global model. As shown in the following equation. Only model parameters are transmitted in the communication process of the local model and the global model, and the basic framework of the federal mean-learning algorithm is shown in fig. 6.

Wherein

Is a weight parameter for each agent Actor network,

is the weight parameter of each agent Critic network.

The overall flow chart of the invention is shown in fig. 7.

Claims

1. The data center air conditioner energy-saving control method based on the federal reinforcement learning is characterized by comprising the following steps of:

based on two modules: the first module is used for performing reinforcement learning on a single air-conditioning system to perform energy-saving optimization, and the second module is used for performing energy-saving optimization control by combining a plurality of air-conditioning refrigeration systems through federal learning;

module 1: a module for performing energy-saving optimization on a single air conditioning system through reinforcement learning; the energy-saving optimization method comprises the following steps:

step 1: modeling a single air-conditioning refrigeration system by a Markov decision process; the refrigerating system machine room of the data center adopts an airflow organization form that a water chilling unit supplies air from a floor and returns air from a suspended ceiling, the operation optimization of a cold channel closed air conditioning system is an important component part for saving energy of the data center, and the basic aim of the operation optimization is to reduce energy consumption as far as possible on the premise of meeting the requirements of a controlled area;

the following relationships exist between the components of the refrigeration system: between cooling water set and data center computer lab: chilled water is generated by the water chilling unit, is conveyed to a server air-water heat exchanger in a data center machine room through a chilled water circulating pump, is heated after passing through the server air-water heat exchanger, and carries heat back to the water chilling unit; between the water chilling unit and the cooling tower: high-temperature cooling water is sent to a cooling tower through a cooling water circulating pump, forms cooling water with a lower temperature after wind-water heat exchange of the cooling tower and returns to a water chilling unit to complete circulation; modeling the strategy optimization problem of the data center cold water system into a Markov decision process, and describing the strategy optimization problem as a quadruple { S, A, P, R }: wherein S is a state space represented by a cooling area and external environment parameters; a is an action space composed of available control commands of a refrigeration system controller; p is the transition probability between different environmental states of the cold source system; r is the real-time reward that can be obtained by taking different control actions A in the state S;

step 1.1 determining Markov quadruple variables

Selecting indoor temperature, indoor relative humidity, outdoor temperature and outdoor relative humidity as state space parameters; taking the current time t as the current state, and taking the previous two times t-1 and t-2 as the past states; the action space A selects chilled water supply temperature, chilled water supply and return water pressure difference and cooling water return water temperature as control actions; the transition probability P depends on the real state of the system environment after the control action is executed, and unbiased estimation needs to be made by means of multiple Monte Carlo sampling alignment; the immediate reward R is shown in formula (1):

wherein Power(s)_t，a_t) The energy consumption of the cold source system is at the moment t, Temp is the punishment that the temperature of the machine room crosses a safety boundary, and beta is a punishment coefficient; controlling the temperature at 15-20 ℃ as the temperature requirement of a controlled area:

represents the lower safe temperature limit of the controlled area,Y20 represents the upper safe temperature limit of the controlled area;

Temp＝log(1+exp(Temperature)²) (2)

wherein the Temperature represents the Temperature in the cabinet corresponding to the single air conditioner;

step 2, establishing a database, wherein the database receives data acquired by various sensors and converts the data into(s) through the quadruple in the step 1_t，a_t，r_t，s_t+1) The data format of the form is stored in a corresponding file of a database after conversion is finished, the upper limit of the number of files of the database is set to be Top, and the earliest stored data is discarded when the number of files reaches the upper limit;

step 3 Depth Deterministic Policy Gradient (DDPG) algorithm:

in reinforcement learning, the strategy pi of the intelligent agent is directly represented in a parameterization mode, the mapping relation of a learning environment and action is represented by a state value function:

V_π(s；ω)＝E[∑_t＝0r(s_t，a_t)；π_ω]＝∑_τP(τ；ω)R(τ) (3)

where ω denotes a learning parameter, π_ωRepresenting a parameterized policy; r(s)_t，a_t) Representing the instant report at the current moment; τ represents a set of state-motion trajectory sequences; p (tau; omega) represents the probability of the occurrence of the track sequence tau; r (tau) represents the accumulated return of the track sequence tau;

step 3.1 derives from equation 3, as shown in equation 4:

when using a parameterized strategy pi_ωAfter sampling n tracks, the average experience of the n tracks can be used to approximate the strategy gradient, as shown below;

wherein

Is the gradient direction of the ith trace, R_i(τ) represents a step size of the parameter update;

step 3.2 defining a deterministic policy gradient; in the Q-learning algorithm, the state-action value function represents the expected value of the accumulated reward at the state-action as an O-value function, as shown below;

wherein E_πRepresenting cumulative reward expectation, γ ∈ (0,1) is a discount factor representing the degree of decay of immediate reward with system state transitions, r_tFor instant reporting, a₀Indicates the initial movement, s₀Represents an initial state, and t represents a system time; using a deterministic strategy mu_ω(a | s) replaces the probability P in the formula (5) and replaces the trajectory return with a Q value function, so that a deterministic strategy gradient calculation formula can be obtained:

and 3.3, building the DDPG network. The DDPG algorithm is realized by adopting a framework of action (Actor) and comment (Critic), wherein the Actor is responsible for giving actions in different environment states, and the Critic expresses the quality degree of the corresponding action; the Actor network and the Critic network are both neural networks, the Actor network output layer comprises 3 neurons, and the Critic network output layer comprises 1 neuron;

step 3.4, updating parameters;

the parameter update formula is as follows:

θ_t+1＝θ_t+α_θ[r_t+γQ(s_t+1，a_t+1)-Q(s_t，a_t)] (8)

where θ is the weight parameter of Critic network, ω is the weight parameter of Actor network, α_θIs the learning rate of the Critic network, setting α_θIs 0.005, alpha_ωIs the learning rate of the Actor network, will be α_ωSet to 0.0005;

the energy-saving optimization control of the plurality of air-conditioning refrigeration systems is a program which adopts a second module and combines the plurality of air-conditioning refrigeration systems by federal learning to perform energy-saving optimization control:

step 4, a federal average learning algorithm; combining a plurality of energy-saving optimized air-conditioning refrigeration system intelligent agents and training two networks of DDPG to obtain a global model with strong generalization performance; the client downloads the initial model parameters from the global model in each communication round, trains the local models through respective local data, and then updates the parameters, namely, the parameter theta in the formula (8) and the formula (9)_t+1And ω_t+1Sending back to the terminal server, and performing average aggregation on local model parameters on the terminal server to continuously update the global model; as shown in the following equation; only model parameters are transmitted in the communication process of the local model and the global model;

wherein

Is a weight parameter of the kth air conditioning system Actor network,

is the weight parameter of the kth agent Critic network.

2. The data center air conditioner energy-saving control method based on the federal reinforcement learning of claim 1, wherein the Actor network and the Critic network are both constructed by a 3-layer neural network; wherein the Actor network input layer comprises 200 neurons, the middle hidden layer comprises 300 neurons, and the output layer comprises 3 neurons. The Critic network has 400 neurons in the input layer, 300 neurons in the middle hidden layer and 1 neuron in the output layer.