CN114234381A - Central air conditioner control method and system based on reinforcement learning - Google Patents

Central air conditioner control method and system based on reinforcement learning Download PDF

Info

Publication number
CN114234381A
CN114234381A CN202111420612.2A CN202111420612A CN114234381A CN 114234381 A CN114234381 A CN 114234381A CN 202111420612 A CN202111420612 A CN 202111420612A CN 114234381 A CN114234381 A CN 114234381A
Authority
CN
China
Prior art keywords
network
central air
conditioning system
current
control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111420612.2A
Other languages
Chinese (zh)
Inventor
郭睿
陈东
叶傲霜
李逸超
徐刚
胥栋
李赟
石珺
林巧月
周思瑜
钱韦辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shanghai Electric Power Co Ltd
Original Assignee
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shanghai Electric Power Co Ltd filed Critical State Grid Shanghai Electric Power Co Ltd
Priority to CN202111420612.2A priority Critical patent/CN114234381A/en
Publication of CN114234381A publication Critical patent/CN114234381A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/50Control or safety arrangements characterised by user interfaces or communication
    • F24F11/54Control or safety arrangements characterised by user interfaces or communication using one central controller connected to several sub-controllers
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/50Control or safety arrangements characterised by user interfaces or communication
    • F24F11/56Remote control
    • F24F11/58Remote control using Internet communication
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/50Control or safety arrangements characterised by user interfaces or communication
    • F24F11/61Control or safety arrangements characterised by user interfaces or communication using timers
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F2110/00Control inputs relating to air properties
    • F24F2110/10Temperature
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F2140/00Control inputs relating to system states
    • F24F2140/20Heat-exchange fluid temperature

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Chemical & Material Sciences (AREA)
  • Combustion & Propulsion (AREA)
  • Mechanical Engineering (AREA)
  • General Engineering & Computer Science (AREA)
  • Air Conditioning Control Device (AREA)

Abstract

A central air-conditioning system control method based on reinforcement learning comprises the following steps: s1, designing the space where the central air-conditioning system is located and the state space S of the central air-conditioning system, controlling action A for controlling the central air-conditioning system and reward function rt(ii) a S2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network; and S3, executing the DDPG network to control the central air-conditioning system. The method adopts a Deep Deterministic Policy Gradient (DDPG) method to solve the control action, is not influenced by model parameters, and is applied to the field of data processingThe air conditioner load regulation and control capacity is increased on the premise of ensuring the comfort of a user.

Description

Central air conditioner control method and system based on reinforcement learning
Technical Field
The invention relates to the technical field of central air-conditioning system control, in particular to a central air-conditioning control method and a central air-conditioning control system based on reinforcement learning.
Background
The central air conditioning system (HVAC) is used as a main device for building energy consumption, has the characteristics of long operation time, large power, flexible temperature regulation and control range and the like, and is a demand side resource with great potential. The building environment has heat storage capacity, the load regulation has certain energy storage characteristic compared with the traditional load, and the demand response is implemented to reduce the load and realize the power peak demand, so that the heat storage type heat storage system becomes the potential field of the central air-conditioning system with the largest energy conservation. In order to adapt to the continuously changing outdoor weather conditions and the changing conditions of indoor loads, how to select a proper controller to reasonably regulate and control a central air conditioning system on the premise of ensuring the comfort requirement of a user is always the research focus of building operation optimization for achieving the purpose of reducing the building load in peak hours.
Currently, the control methods of the central air conditioning system include:
1) the conventional control modes include rule-based control (such as start-stop control), PID control and the like. Conventional control approaches determine supervisory level set points for the central air conditioning system, such as various temperature/flow rate set points, using rule-based control methods, which are typically static and determined from the experience of engineers and facility managers, requiring extensive a priori knowledge and accurate system model parameters. The system is widely applied to practical engineering projects due to the characteristics of simple design, low cost and the like, however, as a typical complex multivariable system with high nonlinearity, coupling, time-varying property and uncertainty, the traditional control mode is often difficult to obtain ideal operation effect.
2) A model predictive control Method (MPC), the basic idea of which is to obtain an optimal control strategy at each time step by performing a rolling optimization over a time window in the future. By predicting future indoor disturbances and outdoor weather conditions, building energy efficiency can be significantly improved. However, the actual operation effect of the MPC heavily depends on the accuracy of the model, especially for the problem of building hot and humid environment control, it is difficult to establish a building dynamic model which is accurate and can be applied in real-time optimization control, and once the mathematical model has a large deviation from the actual situation, the effect of the control strategy calculated by the MPC is difficult to guarantee. Also, using the MPC method requires low-order system dynamics and objective functions, developing a "model" of MPC is complex, and linear models are typically used to model building temperature response, thus requiring careful selection of control variables to ensure a low-order relationship between central air conditioning energy consumption and state and control variables.
3) A heuristic algorithm (such as a genetic algorithm, a particle swarm algorithm and the like) is adopted, the genetic algorithm is used for realizing the energy-saving optimized operation of the central air-conditioning system, a black box model needs to be established in the optimization method, and the mechanism modeling and parameter identification work is complex.
4) The reinforcement learning method mainly utilizes the traditional Tabular Q-learning algorithm to realize the operation optimization of the air conditioning system, but in the actual control problem, the dimension of the system state space and the action space is large, and the algorithm faces the dimension disaster. Based on the strong generalization ability of the neural network, the dimension disaster problem can be solved by parameterizing the approximate value function, but the situation of over-estimation of the value function easily occurs in the single neural network structure in the algorithm learning process. On the basis of the reinforcement learning method, the problem of gradient disappearance can be solved by using the LSTM neural network, the stability of the reinforcement learning algorithm is improved, but the condition of over-estimation of a value function is still not improved.
The existing problems of the control method of the central air-conditioning system can be summarized as the problems of difficult modeling or inaccurate modeling. Therefore, it is necessary to provide a control method of a central air conditioning system, which solves the problem of difficult accurate modeling.
Disclosure of Invention
The invention provides a central air-conditioning control method based on reinforcement learning, which adopts a Deep Deterministic Policy Gradient (DDPG) method to solve control actions, is not influenced by model parameters, and increases the air-conditioning load regulation and control capability on the premise of ensuring the comfort of a user.
To achieve the above and other related objects, the present invention provides a central air conditioning system control method based on reinforcement learning, comprising the steps of:
s1, designing state space S of central air-conditioning system, controlling action A for controlling central air-conditioning system and reward function rt
The state space S at least comprises air conditioner load, weather interference factors of the temperature of a controlled area, outdoor weather conditions, chilled water supply temperature, advanced refrigeration time, demand response time, refrigerator operation state and time sequence;
the control action A is to shut down the central air-conditioning system or select a water supply temperature from a water supply temperature set as the water supply temperature of the central air-conditioning system, and the control action A is selected based on the state space S;
the reward function rtThe system is used for judging the control result generated by the control action A to obtain an award value;
s2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network;
and S3, executing the DDPG network to control the central air-conditioning system.
Preferably, the reward function rtThe formula of (1) is:
rt=-[η×(Tsetlow-Tave)×λ+β×Phvac]
wherein eta, lambda and beta represent adjustable hyper-parameters, eta and beta control the relative importance between the energy consumption of the building air conditioner and the indoor thermal comfort level for optimization, lambda represents the punishment level of the room temperature violating the temperature of the controlled area in the idle time, and TsetlowRepresents a penalty threshold for indoor air temperature, TaveRepresents the average indoor temperature, all parameters were normalized.
Preferably, the implementation method of the DDPG network includes the following steps:
s3.1, randomly initializing a current critic network Q, a current actor network mu and target networks Q 'and mu' of the current critic network Q and the actor network mu, randomly initializing a playback buffer R, and randomly initializing N;
s3.2, setting an initial state S based on the state space StThe initial state stInputting the current operator network mu to obtain an initial action at
S3.3, executing an initial action atAccording to said reward function rtReceive an initial reward RtAnd enter the next state st+1Will [ s ]t,at,Rt,st+1]Storing the composition set into the playback buffer R;
s3.3 [ S ] for the playback buffer Rt,at,Rt,st+1]Performing m times of random sampling, wherein t is 1,2, and m is more than or equal to 2, and setting a target network Q' of the current critic network Q based on m times of random sampling samples to obtain a target network yt
S3.4, the target network ytSubstituting the loss function of the current critic network Q to update the current critic networkQ, updating the current operator network mu by adopting gradient back propagation;
s3.5, updating the target networks Q 'and mu' in proportion.
Preferably, the loss function of the current critic network Q is:
Figure BDA0003377229300000041
wherein, Q(s)t,at| θ Q) denotes stAnd atSubstituting into the current critical network Q and having a network parameter of thetaQ,t=1,2,...,m,m≥2。
Preferably, the specific formula for updating the current operator network μ according to the gradient back propagation is as follows:
Figure BDA0003377229300000051
wherein, thetaμRepresenting a network parameter of the current actor network mu.
Preferably, the scaling target networks Q ' and μ ' represent network parameters θ of the scaling target networks Q ' and μQ'And thetaμ'And specifically according to the following formula:
θQ'←τθQ+(1-τ)θQ'
θμ'←τθμ+(1-τ)θμ'
where τ represents an update coefficient.
Preferably, the time series is defined as:
Figure BDA0003377229300000052
wherein S ismThe time series is represented by a time series,
Figure BDA0003377229300000053
representing a time series of seven days per week,
Figure BDA0003377229300000054
representing a time series of 144 for 10min per day.
Based on the same inventive concept, the present invention also provides a control system of a central air-conditioning system, which includes an arm-based embedded device, wherein the embedded device is deployed with a program of the reinforcement learning-based central air-conditioning system control method as described in any one of the above, so that the embedded device is used for performing shutdown and temperature operation of an air conditioner.
In conclusion, the central air-conditioning control method and the control system based on reinforcement learning provided by the invention can achieve the purpose of reducing the building load in the peak time on the premise of not influencing the comfort level of the user by a series of technical means of regulating and controlling the building air-conditioning system, and the method has strong convergence capability and good stability and improves the system efficiency by continuous learning.
Drawings
Fig. 1 is a schematic step diagram of a central air-conditioning control method based on reinforcement learning according to an embodiment of the present invention;
fig. 2 is a diagram illustrating accumulated rewards in a DDPG network learning process in a central air-conditioning control method based on reinforcement learning according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating the control effect of the DDPG network in the central air-conditioning control method based on reinforcement learning according to an embodiment of the present invention;
fig. 4 is a schematic diagram of load shedding potential when the DDPG network selects different pre-cooling time lengths in the central air-conditioning control method based on reinforcement learning according to an embodiment of the present invention.
Detailed Description
The following describes the central air-conditioning control method and system based on reinforcement learning in detail with reference to the accompanying drawings and the detailed description. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
First, the DDPG mentioned in the present invention is explained: DDPG is evolved from DDQN, the current Q network of DDQN is responsible for calculating the executable action Q value of the current state space S, then the action A is selected by using an epsilon-greedy strategy, the action A is executed to obtain a new state space S ' and reward, a sample is put into an experience pool, namely a replay buffer, the executable action is calculated on the next state space S ' sampled in the replay buffer, then the action A ' is selected by using a greedy strategy, the target Q network calculates the Q value, and after the target Q network calculates the target Q value, the Loss Function is calculated and the updating parameters are propagated in a gradient reverse mode. And the target Q network is responsible for calculating a target Q value of the experience pool sample according to the Q value and action decoupling idea by combining the current Q network, and periodically updating parameters from the current Q network. In the DDPG, the function positioning of the Critic current network, the Critic target network and the current Q network and the target Q network of the DDQN are basically similar. However, the DDPG has an Actor policy network belonging to itself, so that the epsilon-greedy policy is not needed but the Actor is used to select action a from the current network. The action a 'is selected by the Actor target network without greedy for the next state space S' sampled in the experience pool. The DDPG incorporates the concepts of empirical playback (Experience Replay) and dual networks, i.e., the current network and the target network. However, since there are two networks, namely an Actor network and a Critic network, the two networks become 4 networks, which are: an Actor current network, an Actor target network, a criticic current network, and a criticic target network. The structures of the 2 Actor networks are the same, and the structures of the 2 Critic networks are the same.
Referring to fig. 1, the present invention provides a central air conditioning system control method based on reinforcement learning, including the following steps:
s1, designing state space S of central air-conditioning system, controlling action A for controlling central air-conditioning system and reward function rt(ii) a The state space S at least comprises air conditioner load, weather interference factors of the temperature of a controlled area, outdoor weather conditions, chilled water supply temperature, advanced refrigeration time, demand response time, refrigerator operation state and time sequence; the control action A is to shut down the central air-conditioning system or select a water supply temperature from a water supply temperature set as the water supply temperature of the central air-conditioning system, and the control action A is selected based on the state space S; the reward function rtThe system is used for judging the control result generated by the control action A to obtain an award value;
s2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network;
and S3, executing the DDPG network to control the central air-conditioning system.
In this embodiment, for step S1, the present invention describes the demand response problem of the central air conditioning system as a markov decision process, determines the observable state space and control information, and designs a reward function to accelerate the optimization process of the agent.
Firstly, a state space design is carried out, namely, a space where the central air-conditioning system is located and a state space S of the central air-conditioning system are designed, wherein S is equal to [ P [ [ P ]havc,Tin,Tout,Tsupply,Tp,Te,si,t,Sm]Wherein P ishavcRepresenting air conditioning load, influenced by control strategy action, TinFor the temperature of the controlled area subject to weather and disturbance factors, ToutFor outdoor weather conditions, TsupplyTemperature of the chilled water supply, TpTo advance the cooling time, TeFor the duration of the demand response, si,tFor the operating state of the refrigerator, time series SmCan be defined as:
Figure BDA0003377229300000081
in the formula:
Figure BDA0003377229300000082
Figure BDA0003377229300000083
represents a seven day weekly time series;
Figure BDA0003377229300000084
representing a time series of 144 for 10min per day.
Then, control operation design is carried out: is represented by A, A ═ off, a1,a2,...,ai]In the formula: off is an off state, aiAnd (i ═ 1, 2.., n) represents that the water supply temperature of the building takes different values at different moments.
Finally, carrying out reward function design: by rtIs represented byt=-[η×(Tsetlow-Tave)×λ+β×Phvac]Wherein eta, lambda and beta are adjustable hyper-parameters, and eta and beta control the relative importance between the energy consumption of the building air conditioner and the indoor thermal comfort level for optimization; λ is the penalty level for a temperature violation in the controlled area during idle time, TsetlowIs a penalty threshold for indoor air temperature, TaveIs the indoor average temperature, and all parameters are normalized.
In this embodiment, the DDPG is performed according to the following steps:
s3.1, randomly initializing a current critic network Q, a current actor network mu and target networks Q 'and mu' of the current critic network Q and the actor network mu, and randomly initializing a playback buffer R;
s3.2, setting an initial state S based on the state space StThe initial state stInputting the current operator network mu to obtain an initial action at
Figure BDA0003377229300000091
Wherein, N represents random noise, in order to increase some randomness in the learning process and increase the learning coverage, DDPG adds a certain noise N to the selected action.
S3.3, executing an initial action atAccording to said reward function rtReceive an initial reward RtAnd enter the next state st+1Will [ s ]t,at,Rt,st+1]Storing the composition set into the playback buffer R;
s3.3 [ S ] for the playback buffer Rt,at,Rt,st+1]Performing m times of random sampling, wherein t is 1,2, and m is more than or equal to 2, and setting a target network Q' of the current critic network Q based on m times of random sampling samples to obtain a target network yt
S3.4, the target network ytSubstituting the loss function of the current critic network Q to update the current critic network Q, and then updating the current operator network mu according to gradient back propagation;
s3.5, updating the target networks Q 'and mu' in proportion.
The basic idea is to adopt a convolutional neural network, namely the mu network and the Q network, as a simulation of a strategy function, and then train the neural network by using a deep learning method. The method is a deterministic behavior strategy, the behavior of each step directly obtains a determined value through a strategy function, and meanwhile, a convolutional neural network is continuously optimized through deep learning, so that the strategy function is improved.
In this embodiment, the loss function of the current critic network Q is:
Figure BDA0003377229300000092
wherein, Q(s)t,atQ) Denotes a general formula stAnd atSubstituting into the current critical network Q and having a network parameter of thetaQ,t=1,2,...,m,m≥2。
In this embodiment, the specific formula for updating the current operator network μ according to the gradient back propagation is as follows:
Figure BDA0003377229300000101
wherein, thetaμRepresenting a network parameter of the current actor network mu.
In the present embodiment, the scaling target networks Q ' and μ ' represent the network parameters θ of the scaling target networks Q ' and μQ'And thetaμ'And specifically according to the following formula:
θQ'←τθQ+(1-τ)θQ'
θμ'←τθμ+(1-τ)θμ'
where τ represents an update coefficient, which is generally taken to be relatively small, such as a value of 0.1 or 0.01.
In addition, the inventor tests the method, and in the whole process of continuous demand response, pre-cooling is carried out in advance for a certain time according to the learned strategy, and the start time, the duration and the pre-cooling temperature of the pre-cooling are learned by an intelligent agent. The whole demand response duration is independently learned by the intelligent agent according to the change sensitivity of the outdoor temperature difference, and after the demand response is finished, the unit normally operates according to the outdoor temperature. In the control process of carrying out demand response verification on the central air-conditioning system by using reinforcement learning, parameters need to be configured, example system parameter setting, action family network learning rate of 0.001, comment family network learning rate of 0.0001, discount factor of 0.99 and target network updating parameter of 0.001 are carried out.
The experimental data are derived from part of actual operation data of a building in a certain region in 7.8 months from two years, and the outdoor temperature set in the experiment changes according to the temperature day of the building in 7.8 months from two years in the certain region.
Fig. 2 shows the accumulated reward in the whole DDPG network learning process, when the collected data is just trained, because there is no prior knowledge or rule, the accumulated reward is very small due to the large loss caused by frequently violating the requirement of indoor comfort, after a period of trial and error, the deep reinforcement algorithm learning becomes effective learning, the indoor temperature of the controlled area is kept in the required range, and the accumulated reward is gradually increased. Finally, when the deep reinforcement learning algorithm learns the strategy of avoiding temperature violation and minimizing energy consumption and can learn a better demand response strategy at the same time, it is learned that the strategy is a balance of comfort demand, energy consumption demand and demand response strategy. The Q value will then stabilize, indicating that the proposed method successfully learns a strategy to maximize the jackpot.
Because the outdoor temperature of the running environment of the air conditioner has certain regularity, the cold load of the central air conditioner is influenced by the outdoor temperature and has close relation with time. Thus select min=1440min,mHvac1440min is the length of the input time sequence of the above DDPG network, there are 4 observations of the agent:
Figure BDA0003377229300000111
Figure BDA0003377229300000112
when the DDPG network learns to lower the indoor set temperature in advance, the room is pre-cooled in the valley period, and the control effect is as shown in fig. 3.
By utilizing the heat storage characteristic of the building, the advanced refrigeration control strategy is adopted in the period of about 13:20-14:00 (namely 800-. Under the strategy, the operation time of the indoor temperature with high load rate is shortened, a certain peak clipping effect is achieved, and a certain demand response requirement is met. However, the higher demand response effect can be achieved by selecting the appropriate advance cooling time period, and the load reduction potential by selecting different pre-cooling time periods is shown in fig. 4.
On the premise that the advance refrigeration strategy is proved to be feasible, the regulation and control potentials of the load can be affected differently by exploring and setting different refrigeration time, the temperature of a controlled area is reduced along with the increase of the advance refrigeration time, the load reduction potential of the system is enhanced, but the load reduction potential can not change greatly after the refrigeration time exceeds 40 minutes. This is because, after the load reduction strategy is implemented, the change speed of the room temperature of the building is very sensitive due to the heat storage characteristics, and the longer the advanced refrigeration time is, the greater the regulation potential of the central air conditioning system is, but after the demand response lasts for a period of time, the temperature difference becomes small for different periods of time, the room temperature has already reached the lower temperature limit of 24 ℃ which is the constraint, and the load reduction capability is not greatly improved any more. Generally speaking, as the demand response time is increased, the regulation potential does not change too much, and when the demand response duration is shorter, the strategy in the scene has better demand response characteristics.
Through the analysis, the comfort level requirement of a user can be well guaranteed by refrigerating in advance in the regulation and control time period, and meanwhile, the self load is reduced. The heat energy storage characteristic of the building air conditioner participating in demand response is also reflected, and the fact that the central air conditioning system is a good user-side demand response resource is demonstrated.
Based on the same invention conception, the invention also provides a control system of the central air-conditioning system, which comprises an arm-based embedded device, wherein the embedded device is deployed with a program of the reinforcement learning-based central air-conditioning system control method, so that the embedded device is used for shutting down an air conditioner and operating the air conditioner at a temperature.
The central air-conditioning control method and the control system based on reinforcement learning provided by the invention can achieve the purpose of reducing the building load in the peak time period by a series of technical means of regulating and controlling the building air-conditioning system without influencing the comfort level of a user.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (8)

1. A central air-conditioning system control method based on reinforcement learning is characterized by comprising the following steps:
s1, designing state space S of central air-conditioning system, controlling action A for controlling central air-conditioning system and reward function rt
The state space S at least comprises air conditioner load, weather interference factors of the temperature of a controlled area, outdoor weather conditions, chilled water supply temperature, advanced refrigeration time, demand response time, refrigerator operation state and time sequence;
the control action A is to shut down the central air-conditioning system or select a water supply temperature from a water supply temperature set as the water supply temperature of the central air-conditioning system, and the control action A is selected based on the state space S;
the reward function rtThe system is used for judging the control result generated by the control action A to obtain an award value;
s2, based on the state space S, the control action A and the reward function rtDesigning a DDPG network;
and S3, executing the DDPG network to control the central air-conditioning system.
2. The reinforcement learning-based central air conditioning system control method of claim 1, wherein the reward function r istThe formula of (1) is:
rt=-[η×(Tsetlow-Tave)×λ+β×Phvac]
wherein eta, lambda and beta represent adjustable hyper-parameters, eta and beta control the relative importance between the energy consumption of the building air conditioner and the indoor thermal comfort level for optimization, lambda represents the punishment level of the room temperature violating the temperature of the controlled area in the idle time, and TsetlowRepresents a penalty threshold for indoor air temperature, TaveRepresents the average indoor temperature, all parameters were normalized.
3. The reinforcement learning-based central air conditioning system control method of claim 1, wherein the implementation method of the DDPG network comprises the following steps:
s3.1, randomly initializing a current critic network Q, a current actor network mu and target networks Q 'and mu' of the current critic network Q and the actor network mu, randomly initializing a playback buffer R, and randomly initializing N;
s3.2, setting an initial state S based on the state space StThe initial state stInputting the current operator network mu to obtain an initial action at
S3.3, executing an initial action atAccording to said reward function rtReceive an initial reward RtAnd enter the next state st+1Will [ s ]t,at,Rt,st+1]Storing the composition set into the playback buffer R;
s3.3 [ S ] for the playback buffer Rt,at,Rt,st+1]Performing m times of random sampling, wherein t is 1,2, and m is more than or equal to 2, and setting a target network Q' of the current critic network Q based on m times of random sampling samples to obtain a target network yt
S3.4, the target network ytSubstituting the loss function of the current critic network Q to update the current critic network Q, and then updating the current operator network mu by adopting gradient back propagation;
s3.5, updating the target networks Q 'and mu' in proportion.
4. The reinforcement learning-based central air conditioning system control method of claim 3, wherein the loss function of the current criticc network Q is:
Figure FDA0003377229290000021
wherein, Q(s)t,atQ) Denotes a general formula stAnd atSubstituting into the current critical network Q and having a network parameter of thetaQ,t=1,2,...,m,m≥2。
5. The reinforcement learning-based central air conditioning system control method of claim 4, wherein the specific formula for updating the current operator network μ according to the gradient back propagation is:
Figure FDA0003377229290000031
wherein, thetaμRepresenting a network parameter of the current actor network mu.
6. The central air-conditioning system controlling method based on reinforcement learning of claim 5, wherein the ratio update target networks Q ' and μ ' represent network parameters θ of the ratio update target networks Q ' and μQ'And thetaμ'And specifically according to the following formula:
θQ'←τθQ+(1-τ)θQ'
θμ'←τθμ+(1-τ)θμ'
where τ represents an update coefficient.
7. The intensive learning-based central air-conditioning system control method according to claim 1, wherein the time series is defined as:
Figure FDA0003377229290000032
wherein S ismThe time series is represented by a time series,
Figure FDA0003377229290000033
representing a time series of seven days per week,
Figure FDA0003377229290000034
representing a time series of 144 for 10min per day.
8. A control apparatus of a central air-conditioning system, characterized in that the apparatus is deployed with a program of the reinforcement learning-based central air-conditioning system control method according to any one of claims 1 to 7 to make the embedded device used for the shutdown and temperature operation of the air-conditioner.
CN202111420612.2A 2021-11-26 2021-11-26 Central air conditioner control method and system based on reinforcement learning Pending CN114234381A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111420612.2A CN114234381A (en) 2021-11-26 2021-11-26 Central air conditioner control method and system based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111420612.2A CN114234381A (en) 2021-11-26 2021-11-26 Central air conditioner control method and system based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN114234381A true CN114234381A (en) 2022-03-25

Family

ID=80751285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111420612.2A Pending CN114234381A (en) 2021-11-26 2021-11-26 Central air conditioner control method and system based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN114234381A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115289619A (en) * 2022-07-28 2022-11-04 安徽大学 Subway platform HVAC control method based on multi-agent deep reinforcement learning
CN116481149A (en) * 2023-06-20 2023-07-25 深圳市微筑科技有限公司 Method and system for configuring indoor environment parameters
CN118224713A (en) * 2024-05-07 2024-06-21 深圳市亚晔实业有限公司 Air ventilation and exhaust cooperative control method and device based on multi-agent system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040204784A1 (en) * 2002-12-16 2004-10-14 Maturana Francisco Paul Decentralized autonomous control for complex fluid distribution systems
CN101650063A (en) * 2008-08-11 2010-02-17 柯细勇 Climate compensation controller for central air conditioner and climate compensation method for central air conditioner
CN108386971A (en) * 2018-01-28 2018-08-10 浙江博超节能科技有限公司 Central air-conditioning energy robot control system(RCS)
CN109890176A (en) * 2019-03-01 2019-06-14 北京慧辰资道资讯股份有限公司 A kind of method and device based on artificial intelligence optimization's energy consumption of machine room efficiency
CN110398029A (en) * 2019-07-25 2019-11-01 北京上格云技术有限公司 Control method and computer readable storage medium
CN111623497A (en) * 2020-02-20 2020-09-04 上海朗绿建筑科技股份有限公司 Radiation air conditioner precooling and preheating method and system, storage medium and radiation air conditioner
CN112966431A (en) * 2021-02-04 2021-06-15 西安交通大学 Data center energy consumption joint optimization method, system, medium and equipment
CN113283156A (en) * 2021-03-29 2021-08-20 北京建筑大学 Subway station air conditioning system energy-saving control method based on deep reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040204784A1 (en) * 2002-12-16 2004-10-14 Maturana Francisco Paul Decentralized autonomous control for complex fluid distribution systems
CN101650063A (en) * 2008-08-11 2010-02-17 柯细勇 Climate compensation controller for central air conditioner and climate compensation method for central air conditioner
CN108386971A (en) * 2018-01-28 2018-08-10 浙江博超节能科技有限公司 Central air-conditioning energy robot control system(RCS)
CN109890176A (en) * 2019-03-01 2019-06-14 北京慧辰资道资讯股份有限公司 A kind of method and device based on artificial intelligence optimization's energy consumption of machine room efficiency
CN110398029A (en) * 2019-07-25 2019-11-01 北京上格云技术有限公司 Control method and computer readable storage medium
CN111623497A (en) * 2020-02-20 2020-09-04 上海朗绿建筑科技股份有限公司 Radiation air conditioner precooling and preheating method and system, storage medium and radiation air conditioner
CN112966431A (en) * 2021-02-04 2021-06-15 西安交通大学 Data center energy consumption joint optimization method, system, medium and equipment
CN113283156A (en) * 2021-03-29 2021-08-20 北京建筑大学 Subway station air conditioning system energy-saving control method based on deep reinforcement learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115289619A (en) * 2022-07-28 2022-11-04 安徽大学 Subway platform HVAC control method based on multi-agent deep reinforcement learning
CN116481149A (en) * 2023-06-20 2023-07-25 深圳市微筑科技有限公司 Method and system for configuring indoor environment parameters
CN116481149B (en) * 2023-06-20 2023-09-01 深圳市微筑科技有限公司 Method and system for configuring indoor environment parameters
CN118224713A (en) * 2024-05-07 2024-06-21 深圳市亚晔实业有限公司 Air ventilation and exhaust cooperative control method and device based on multi-agent system

Similar Documents

Publication Publication Date Title
CN114234381A (en) Central air conditioner control method and system based on reinforcement learning
Zou et al. Towards optimal control of air handling units using deep reinforcement learning and recurrent neural network
Nagarathinam et al. Marco-multi-agent reinforcement learning based control of building hvac systems
CN111795484B (en) Intelligent air conditioner control method and system
Huang et al. Simulation-based performance evaluation of model predictive control for building energy systems
Privara et al. Model predictive control of a building heating system: The first experience
CN113283156B (en) Energy-saving control method for subway station air conditioning system based on deep reinforcement learning
Li et al. A multi-grid reinforcement learning method for energy conservation and comfort of HVAC in buildings
US20200379417A1 (en) Techniques for using machine learning for control and predictive maintenance of buildings
TWI504094B (en) Power load monitoring and predicting system and method thereof
CN111351180A (en) System and method for realizing energy conservation and temperature control of data center by applying artificial intelligence
WO2015151363A1 (en) Air-conditioning system and control method for air-conditioning equipment
US20130151013A1 (en) Method for Controlling HVAC Systems Using Set-Point Trajectories
Ghahramani et al. Energy trade off analysis of optimized daily temperature setpoints
Heidari et al. Reinforcement Learning for proactive operation of residential energy systems by learning stochastic occupant behavior and fluctuating solar energy: Balancing comfort, hygiene and energy use
Nikovski et al. A method for computing optimal set-point schedules for HVAC systems
CN114110824B (en) Intelligent control method and device for constant humidity machine
CN1920427B (en) Room temperature PID control method of air conditioner set
CN113821903A (en) Temperature control method and device, modular data center and storage medium
CN116907036A (en) Deep reinforcement learning water chilling unit control method based on cold load prediction
Sun et al. Intelligent distributed temperature and humidity control mechanism for uniformity and precision in the indoor environment
Bayer et al. Enhancing the performance of multi-agent reinforcement learning for controlling HVAC systems
Wang et al. A Comparison of Classical and Deep Reinforcement Learning Methods for HVAC Control
Schepers et al. Autonomous building control using offline reinforcement learning
Shen et al. Advanced control framework of regenerative electric heating with renewable energy based on multi-agent cooperation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220325

RJ01 Rejection of invention patent application after publication