Background
The central air conditioning system (HVAC) is used as a main device for building energy consumption, has the characteristics of long operation time, large power, flexible temperature regulation and control range and the like, and is a demand side resource with great potential. The building environment has heat storage capacity, the load regulation has certain energy storage characteristic compared with the traditional load, and the demand response is implemented to reduce the load and realize the power peak demand, so that the heat storage type heat storage system becomes the potential field of the central air-conditioning system with the largest energy conservation. In order to adapt to the continuously changing outdoor weather conditions and the changing conditions of indoor loads, how to select a proper controller to reasonably regulate and control a central air conditioning system on the premise of ensuring the comfort requirement of a user is always the research focus of building operation optimization for achieving the purpose of reducing the building load in peak hours.
Currently, the control methods of the central air conditioning system include:
1) the conventional control modes include rule-based control (such as start-stop control), PID control and the like. Conventional control approaches determine supervisory level set points for the central air conditioning system, such as various temperature/flow rate set points, using rule-based control methods, which are typically static and determined from the experience of engineers and facility managers, requiring extensive a priori knowledge and accurate system model parameters. The system is widely applied to practical engineering projects due to the characteristics of simple design, low cost and the like, however, as a typical complex multivariable system with high nonlinearity, coupling, time-varying property and uncertainty, the traditional control mode is often difficult to obtain ideal operation effect.
2) A model predictive control Method (MPC), the basic idea of which is to obtain an optimal control strategy at each time step by performing a rolling optimization over a time window in the future. By predicting future indoor disturbances and outdoor weather conditions, building energy efficiency can be significantly improved. However, the actual operation effect of the MPC heavily depends on the accuracy of the model, especially for the problem of building hot and humid environment control, it is difficult to establish a building dynamic model which is accurate and can be applied in real-time optimization control, and once the mathematical model has a large deviation from the actual situation, the effect of the control strategy calculated by the MPC is difficult to guarantee. Also, using the MPC method requires low-order system dynamics and objective functions, developing a "model" of MPC is complex, and linear models are typically used to model building temperature response, thus requiring careful selection of control variables to ensure a low-order relationship between central air conditioning energy consumption and state and control variables.
3) A heuristic algorithm (such as a genetic algorithm, a particle swarm algorithm and the like) is adopted, the genetic algorithm is used for realizing the energy-saving optimized operation of the central air-conditioning system, a black box model needs to be established in the optimization method, and the mechanism modeling and parameter identification work is complex.
4) The reinforcement learning method mainly utilizes the traditional Tabular Q-learning algorithm to realize the operation optimization of the air conditioning system, but in the actual control problem, the dimension of the system state space and the action space is large, and the algorithm faces the dimension disaster. Based on the strong generalization ability of the neural network, the dimension disaster problem can be solved by parameterizing the approximate value function, but the situation of over-estimation of the value function easily occurs in the single neural network structure in the algorithm learning process. On the basis of the reinforcement learning method, the problem of gradient disappearance can be solved by using the LSTM neural network, the stability of the reinforcement learning algorithm is improved, but the condition of over-estimation of a value function is still not improved.
The existing problems of the control method of the central air-conditioning system can be summarized as the problems of difficult modeling or inaccurate modeling. Therefore, it is necessary to provide a control method of a central air conditioning system, which solves the problem of difficult accurate modeling.
Disclosure of Invention
The invention provides a central air-conditioning control method based on reinforcement learning, which adopts a Deep Deterministic Policy Gradient (DDPG) method to solve control actions, is not influenced by model parameters, and increases the air-conditioning load regulation and control capability on the premise of ensuring the comfort of a user.
To achieve the above and other related objects, the present invention provides a central air conditioning system control method based on reinforcement learning, comprising the steps of:
s1, designing state space S of central air-conditioning system, controlling action A for controlling central air-conditioning system and reward function rt;
The state space S at least comprises air conditioner load, weather interference factors of the temperature of a controlled area, outdoor weather conditions, chilled water supply temperature, advanced refrigeration time, demand response time, refrigerator operation state and time sequence;
the control action A is to shut down the central air-conditioning system or select a water supply temperature from a water supply temperature set as the water supply temperature of the central air-conditioning system, and the control action A is selected based on the state space S;
the reward function rtThe system is used for judging the control result generated by the control action A to obtain an award value;
s2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network;
and S3, executing the DDPG network to control the central air-conditioning system.
Preferably, the reward function rtThe formula of (1) is:
rt=-[η×(Tsetlow-Tave)×λ+β×Phvac]
wherein eta, lambda and beta represent adjustable hyper-parameters, eta and beta control the relative importance between the energy consumption of the building air conditioner and the indoor thermal comfort level for optimization, lambda represents the punishment level of the room temperature violating the temperature of the controlled area in the idle time, and TsetlowRepresents a penalty threshold for indoor air temperature, TaveRepresents the average indoor temperature, all parameters were normalized.
Preferably, the implementation method of the DDPG network includes the following steps:
s3.1, randomly initializing a current critic network Q, a current actor network mu and target networks Q 'and mu' of the current critic network Q and the actor network mu, randomly initializing a playback buffer R, and randomly initializing N;
s3.2, setting an initial state S based on the state space StThe initial state stInputting the current operator network mu to obtain an initial action at;
S3.3, executing an initial action atAccording to said reward function rtReceive an initial reward RtAnd enter the next state st+1Will [ s ]t,at,Rt,st+1]Storing the composition set into the playback buffer R;
s3.3 [ S ] for the playback buffer Rt,at,Rt,st+1]Performing m times of random sampling, wherein t is 1,2, and m is more than or equal to 2, and setting a target network Q' of the current critic network Q based on m times of random sampling samples to obtain a target network yt;
S3.4, the target network ytSubstituting the loss function of the current critic network Q to update the current critic networkQ, updating the current operator network mu by adopting gradient back propagation;
s3.5, updating the target networks Q 'and mu' in proportion.
Preferably, the loss function of the current critic network Q is:
wherein, Q(s)t,at| θ Q) denotes stAnd atSubstituting into the current critical network Q and having a network parameter of thetaQ,t=1,2,...,m,m≥2。
Preferably, the specific formula for updating the current operator network μ according to the gradient back propagation is as follows:
wherein, thetaμRepresenting a network parameter of the current actor network mu.
Preferably, the scaling target networks Q ' and μ ' represent network parameters θ of the scaling target networks Q ' and μQ'And thetaμ'And specifically according to the following formula:
θQ'←τθQ+(1-τ)θQ'
θμ'←τθμ+(1-τ)θμ'
where τ represents an update coefficient.
Preferably, the time series is defined as:
wherein S is
mThe time series is represented by a time series,
representing a time series of seven days per week,
representing a time series of 144 for 10min per day.
Based on the same inventive concept, the present invention also provides a control system of a central air-conditioning system, which includes an arm-based embedded device, wherein the embedded device is deployed with a program of the reinforcement learning-based central air-conditioning system control method as described in any one of the above, so that the embedded device is used for performing shutdown and temperature operation of an air conditioner.
In conclusion, the central air-conditioning control method and the control system based on reinforcement learning provided by the invention can achieve the purpose of reducing the building load in the peak time on the premise of not influencing the comfort level of the user by a series of technical means of regulating and controlling the building air-conditioning system, and the method has strong convergence capability and good stability and improves the system efficiency by continuous learning.
Detailed Description
The following describes the central air-conditioning control method and system based on reinforcement learning in detail with reference to the accompanying drawings and the detailed description. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
First, the DDPG mentioned in the present invention is explained: DDPG is evolved from DDQN, the current Q network of DDQN is responsible for calculating the executable action Q value of the current state space S, then the action A is selected by using an epsilon-greedy strategy, the action A is executed to obtain a new state space S ' and reward, a sample is put into an experience pool, namely a replay buffer, the executable action is calculated on the next state space S ' sampled in the replay buffer, then the action A ' is selected by using a greedy strategy, the target Q network calculates the Q value, and after the target Q network calculates the target Q value, the Loss Function is calculated and the updating parameters are propagated in a gradient reverse mode. And the target Q network is responsible for calculating a target Q value of the experience pool sample according to the Q value and action decoupling idea by combining the current Q network, and periodically updating parameters from the current Q network. In the DDPG, the function positioning of the Critic current network, the Critic target network and the current Q network and the target Q network of the DDQN are basically similar. However, the DDPG has an Actor policy network belonging to itself, so that the epsilon-greedy policy is not needed but the Actor is used to select action a from the current network. The action a 'is selected by the Actor target network without greedy for the next state space S' sampled in the experience pool. The DDPG incorporates the concepts of empirical playback (Experience Replay) and dual networks, i.e., the current network and the target network. However, since there are two networks, namely an Actor network and a Critic network, the two networks become 4 networks, which are: an Actor current network, an Actor target network, a criticic current network, and a criticic target network. The structures of the 2 Actor networks are the same, and the structures of the 2 Critic networks are the same.
Referring to fig. 1, the present invention provides a central air conditioning system control method based on reinforcement learning, including the following steps:
s1, designing state space S of central air-conditioning system, controlling action A for controlling central air-conditioning system and reward function rt(ii) a The state space S at least comprises air conditioner load, weather interference factors of the temperature of a controlled area, outdoor weather conditions, chilled water supply temperature, advanced refrigeration time, demand response time, refrigerator operation state and time sequence; the control action A is to shut down the central air-conditioning system or select a water supply temperature from a water supply temperature set as the water supply temperature of the central air-conditioning system, and the control action A is selected based on the state space S; the reward function rtThe system is used for judging the control result generated by the control action A to obtain an award value;
s2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network;
and S3, executing the DDPG network to control the central air-conditioning system.
In this embodiment, for step S1, the present invention describes the demand response problem of the central air conditioning system as a markov decision process, determines the observable state space and control information, and designs a reward function to accelerate the optimization process of the agent.
Firstly, a state space design is carried out, namely, a space where the central air-conditioning system is located and a state space S of the central air-conditioning system are designed, wherein S is equal to [ P [ [ P ]
havc,T
in,T
out,T
supply,T
p,T
e,s
i,t,S
m]Wherein P is
havcRepresenting air conditioning load, influenced by control strategy action, T
inFor the temperature of the controlled area subject to weather and disturbance factors, T
outFor outdoor weather conditions, T
supplyTemperature of the chilled water supply, T
pTo advance the cooling time, T
eFor the duration of the demand response, s
i,tFor the operating state of the refrigerator, time series S
mCan be defined as:
in the formula:
represents a seven day weekly time series;
representing a time series of 144 for 10min per day.
Then, control operation design is carried out: is represented by A, A ═ off, a1,a2,...,ai]In the formula: off is an off state, aiAnd (i ═ 1, 2.., n) represents that the water supply temperature of the building takes different values at different moments.
Finally, carrying out reward function design: by rtIs represented byt=-[η×(Tsetlow-Tave)×λ+β×Phvac]Wherein eta, lambda and beta are adjustable hyper-parameters, and eta and beta control the relative importance between the energy consumption of the building air conditioner and the indoor thermal comfort level for optimization; λ is the penalty level for a temperature violation in the controlled area during idle time, TsetlowIs a penalty threshold for indoor air temperature, TaveIs the indoor average temperature, and all parameters are normalized.
In this embodiment, the DDPG is performed according to the following steps:
s3.1, randomly initializing a current critic network Q, a current actor network mu and target networks Q 'and mu' of the current critic network Q and the actor network mu, and randomly initializing a playback buffer R;
s3.2, setting an initial state S based on the state space StThe initial state stInputting the current operator network mu to obtain an initial action at,
Wherein, N represents random noise, in order to increase some randomness in the learning process and increase the learning coverage, DDPG adds a certain noise N to the selected action.
S3.3, executing an initial action atAccording to said reward function rtReceive an initial reward RtAnd enter the next state st+1Will [ s ]t,at,Rt,st+1]Storing the composition set into the playback buffer R;
s3.3 [ S ] for the playback buffer Rt,at,Rt,st+1]Performing m times of random sampling, wherein t is 1,2, and m is more than or equal to 2, and setting a target network Q' of the current critic network Q based on m times of random sampling samples to obtain a target network yt;
S3.4, the target network ytSubstituting the loss function of the current critic network Q to update the current critic network Q, and then updating the current operator network mu according to gradient back propagation;
s3.5, updating the target networks Q 'and mu' in proportion.
The basic idea is to adopt a convolutional neural network, namely the mu network and the Q network, as a simulation of a strategy function, and then train the neural network by using a deep learning method. The method is a deterministic behavior strategy, the behavior of each step directly obtains a determined value through a strategy function, and meanwhile, a convolutional neural network is continuously optimized through deep learning, so that the strategy function is improved.
In this embodiment, the loss function of the current critic network Q is:
wherein, Q(s)t,at|θQ) Denotes a general formula stAnd atSubstituting into the current critical network Q and having a network parameter of thetaQ,t=1,2,...,m,m≥2。
In this embodiment, the specific formula for updating the current operator network μ according to the gradient back propagation is as follows:
wherein, thetaμRepresenting a network parameter of the current actor network mu.
In the present embodiment, the scaling target networks Q ' and μ ' represent the network parameters θ of the scaling target networks Q ' and μQ'And thetaμ'And specifically according to the following formula:
θQ'←τθQ+(1-τ)θQ'
θμ'←τθμ+(1-τ)θμ'
where τ represents an update coefficient, which is generally taken to be relatively small, such as a value of 0.1 or 0.01.
In addition, the inventor tests the method, and in the whole process of continuous demand response, pre-cooling is carried out in advance for a certain time according to the learned strategy, and the start time, the duration and the pre-cooling temperature of the pre-cooling are learned by an intelligent agent. The whole demand response duration is independently learned by the intelligent agent according to the change sensitivity of the outdoor temperature difference, and after the demand response is finished, the unit normally operates according to the outdoor temperature. In the control process of carrying out demand response verification on the central air-conditioning system by using reinforcement learning, parameters need to be configured, example system parameter setting, action family network learning rate of 0.001, comment family network learning rate of 0.0001, discount factor of 0.99 and target network updating parameter of 0.001 are carried out.
The experimental data are derived from part of actual operation data of a building in a certain region in 7.8 months from two years, and the outdoor temperature set in the experiment changes according to the temperature day of the building in 7.8 months from two years in the certain region.
Fig. 2 shows the accumulated reward in the whole DDPG network learning process, when the collected data is just trained, because there is no prior knowledge or rule, the accumulated reward is very small due to the large loss caused by frequently violating the requirement of indoor comfort, after a period of trial and error, the deep reinforcement algorithm learning becomes effective learning, the indoor temperature of the controlled area is kept in the required range, and the accumulated reward is gradually increased. Finally, when the deep reinforcement learning algorithm learns the strategy of avoiding temperature violation and minimizing energy consumption and can learn a better demand response strategy at the same time, it is learned that the strategy is a balance of comfort demand, energy consumption demand and demand response strategy. The Q value will then stabilize, indicating that the proposed method successfully learns a strategy to maximize the jackpot.
Because the outdoor temperature of the running environment of the air conditioner has certain regularity, the cold load of the central air conditioner is influenced by the outdoor temperature and has close relation with time. Thus select m
in=1440min,m
Hvac1440min is the length of the input time sequence of the above DDPG network, there are 4 observations of the agent:
when the DDPG network learns to lower the indoor set temperature in advance, the room is pre-cooled in the valley period, and the control effect is as shown in fig. 3.
By utilizing the heat storage characteristic of the building, the advanced refrigeration control strategy is adopted in the period of about 13:20-14:00 (namely 800-. Under the strategy, the operation time of the indoor temperature with high load rate is shortened, a certain peak clipping effect is achieved, and a certain demand response requirement is met. However, the higher demand response effect can be achieved by selecting the appropriate advance cooling time period, and the load reduction potential by selecting different pre-cooling time periods is shown in fig. 4.
On the premise that the advance refrigeration strategy is proved to be feasible, the regulation and control potentials of the load can be affected differently by exploring and setting different refrigeration time, the temperature of a controlled area is reduced along with the increase of the advance refrigeration time, the load reduction potential of the system is enhanced, but the load reduction potential can not change greatly after the refrigeration time exceeds 40 minutes. This is because, after the load reduction strategy is implemented, the change speed of the room temperature of the building is very sensitive due to the heat storage characteristics, and the longer the advanced refrigeration time is, the greater the regulation potential of the central air conditioning system is, but after the demand response lasts for a period of time, the temperature difference becomes small for different periods of time, the room temperature has already reached the lower temperature limit of 24 ℃ which is the constraint, and the load reduction capability is not greatly improved any more. Generally speaking, as the demand response time is increased, the regulation potential does not change too much, and when the demand response duration is shorter, the strategy in the scene has better demand response characteristics.
Through the analysis, the comfort level requirement of a user can be well guaranteed by refrigerating in advance in the regulation and control time period, and meanwhile, the self load is reduced. The heat energy storage characteristic of the building air conditioner participating in demand response is also reflected, and the fact that the central air conditioning system is a good user-side demand response resource is demonstrated.
Based on the same invention conception, the invention also provides a control system of the central air-conditioning system, which comprises an arm-based embedded device, wherein the embedded device is deployed with a program of the reinforcement learning-based central air-conditioning system control method, so that the embedded device is used for shutting down an air conditioner and operating the air conditioner at a temperature.
The central air-conditioning control method and the control system based on reinforcement learning provided by the invention can achieve the purpose of reducing the building load in the peak time period by a series of technical means of regulating and controlling the building air-conditioning system without influencing the comfort level of a user.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.