CN114020079B

CN114020079B - Indoor space temperature and humidity regulation and control method and device

Info

Publication number: CN114020079B
Application number: CN202111293890.6A
Authority: CN
Inventors: 张勇; 郭达; 孙蕴琪; 罗丹峰; 袁思雨; 张晨曦; 张修勇; 吴来明; 徐方圆
Original assignee: Beijing Pengtong Gaoke Science & Technology Co ltd; SHANGHAI MUSEUM; Beijing University of Posts and Telecommunications
Current assignee: Beijing Pengtong Gaoke Science & Technology Co ltd; SHANGHAI MUSEUM; Beijing University of Posts and Telecommunications
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-09-16
Anticipated expiration: 2041-11-03
Also published as: CN114020079A

Abstract

The invention provides a method and a device for regulating and controlling the temperature and humidity of an indoor space. In the reinforcement learning process, the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation are quoted to calculate an observation reward value so as to comprehensively consider the control precision of the temperature and the humidity and the uniformity of each position in a set space, and the reinforcement control method can finally achieve the effect of accurately and uniformly controlling the temperature and the humidity in the set space.

Description

Indoor space temperature and humidity regulation and control method and device

Technical Field

The invention relates to the technical field of electronic equipment control, in particular to a method and a device for regulating and controlling the temperature and humidity of an indoor space.

Background

The preservation environment of the cultural relics in the museum is closely related to the life of the cultural relics, the temperature and the humidity in the environment are two important factors influencing the preservation of the cultural relics, the stable and proper temperature and humidity environment is one of important conditions for reducing the degradation risk of the cultural relics and prolonging the life of the cultural relics, and the cultural relics made of any material have the range of the optimum temperature and humidity conditions, so that the cultural relics are easy to be damaged if the temperature and the humidity conditions exceed the range. Therefore, it is very critical to carry out accurate regulation and control of humiture in the museum, and because the monitoring point that traditional control method used is single, often can ignore indoor humiture's homogeneity and have certain time ductility problem when dealing with the interference, and this problem can not effectively be solved to present existing research work.

At present, many methods for controlling the temperature and humidity in a museum are proposed, and the floating range of the temperature and humidity is controlled by controlling the wind direction of an air conditioner or by controlling Indirect Direct Evaporative Cooling (IDEC) and Ultrasonic Atomization Humidification (UAH) devices. However, most of the prior art only aims at the accurate control of the temperature and the humidity in small spaces such as display cabinets and the like, and the control accuracy and the uniformity level of the temperature and the humidity of the indoor space such as a museum exhibition hall are not enough, so that the prior art is not suitable for application scenes with high requirements on the temperature and the humidity.

Disclosure of Invention

The embodiment of the invention provides a method and a device for regulating and controlling the temperature and the humidity of an indoor space, which are used for eliminating or improving one or more defects in the prior art and solving the problem that the prior art cannot accurately control the temperature and the humidity in the indoor space.

The technical scheme of the invention is as follows:

in one aspect, the invention provides a method for regulating and controlling the temperature and humidity of an indoor space, which is used for operating on a controller, wherein the controller is connected with a plurality of sensors and a plurality of actuators in a set space through the internet of things, the sensors comprise humidity sensors and temperature sensors, the actuators are air outlets of a constant temperature and humidity machine, and the method comprises the following steps:

acquiring humidity values and temperature value groups acquired by each humidity sensor and each temperature sensor according to appointed interval time as state parameters to form a state space, and taking wind speed gears of each air outlet which operates at set humidity and set temperature as action parameters to form an action space;

outputting corresponding action parameters according to the state parameters of each time step by adopting a preset deep reinforcement learning model and controlling the actuator to execute; acquiring an actual humidity value of each humidity sensor and an actual temperature value of each temperature sensor at each time step, calculating a humidity precision deviation between the actual humidity value of each humidity sensor and the set humidity and a humidity uniformity deviation between the actual humidity values, calculating a temperature precision deviation between the actual temperature value of each temperature sensor and the set temperature and a temperature uniformity deviation between the actual temperature values, and calculating an observation reward value according to the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation corresponding to each time step; storing the state parameters, the action parameters and the observation reward values corresponding to each time step in an experience pool as experience data;

in the deep reinforcement learning process, the preset deep reinforcement learning model samples experience data in an experience pool according to priority, a neural network is adopted to fit the value Q of each action in the current state, the neural network model is provided with a local network for calculating the predicted Q value of each action at the current time step and selecting action parameters according to a set strategy, a target network is provided for calculating the target Q value of each action at the next time step, the local network and the target network have the same structure, and the parameters of the local network are updated to the target network at set time intervals; and the neural network decomposes the Q value of the action parameter into a state value part only related to the state and a dominance function part related to both the state and the action; and constructing a loss function based on the predicted Q value, the target Q value and the observed reward values of a plurality of time steps in the future, and learning until convergence by taking the sum of the maximized reward values as an optimization target.

In some embodiments, in calculating a humidity accuracy deviation between the actual humidity value of each humidity sensor and the set humidity and a humidity uniformity deviation between the actual humidity values, the humidity accuracy deviation is calculated by:

wherein H _sc The deviation in the accuracy of the humidity is indicated,

indicating the value of humidity detected by the ith humidity sensor at time step t, H _set Represents the set humidity, k ₁ The number of the humidity sensors;

the calculation formula of the humidity uniformity deviation is as follows:

wherein H _unif In order to account for the humidity uniformity deviation,

represents the value of the humidity detected by the ith humidity sensor at the time step t,

representing the mean value, k, of the humidity values detected by the humidity sensors at time t ₁ The number of the humidity sensors;

in calculating the temperature accuracy deviation between the actual temperature value of each temperature sensor and the set temperature and the temperature uniformity deviation between the actual temperature values, the calculation formula of the temperature accuracy deviation is as follows:

wherein, T _sc Representing said temperature accuracy deviation, T _t ⁱ Indicating the temperature value, T, detected by the ith temperature sensor at time step T _set Represents the set temperature, k ₂ The number of the temperature sensors;

the calculation formula of the temperature uniformity deviation is as follows:

wherein, T _unif Represents the temperature uniformity deviation, T _t ⁱ Represents the temperature value detected by the ith temperature sensor at the time step t,

representing the average value, k, of the temperature values detected by the temperature sensors at time t ₂ Is the number of said temperature sensors.

In some embodiments, calculating an observed reward value based on the humidity accuracy deviation, the humidity uniformity deviation, the temperature accuracy deviation, and the temperature uniformity deviation for each time step comprises:

R _t ＝α ₁ (T _sc +H _sc )+α ₂ (T _unif +H _unif )；

wherein R is _t For the observed reward value, α ₁ And alpha ₂ Are weight coefficients.

In some embodiments, the sampling of the experience data in the experience pool by the preset deep reinforcement learning model according to the priority includes:

obtaining error TD-error of each time step state parameter and priority p of each empirical data _ξ Proportional to the TD-error, the expression is:

p _ξ ∝|δ _ξ |；

wherein, delta _ξ Is the value of TD-error, p _ξ To the priority, R _t+1 Is the observed reward value of t +1 time step, gamma is the discount factor, S _t+1 State parameter at time step t +1, S _t Is a state parameter of t time step, Q _θ (S _t ,A _t ) Is in state S for the local network _t Action A with the highest output value _t A' is an action corresponding to the maximum predicted Q value selected based on the local network,

state S output for the target network under the condition of selecting action a _t+1 The target Q value of (1).

In some embodiments, constructing a loss function based on the predicted Q value, the target Q value, and observed reward values for a plurality of time steps in the future comprises:

defining a sum of returns R for n time steps in the future _t ⁽ⁿ⁾ The formula of (1) is:

wherein, γ ^(x) Discount factor for the xth time step in the future, R _t+x+1 An observed reward value for the xth time step in the future;

setting the loss function L as:

wherein the content of the first and second substances,

sum of reward values, gamma, for future n-step observations ⁽ⁿ⁾ For a discount factor of n steps, S _t+n State parameter being t + n time step, S _t Is a state parameter of t time step, Q _θ (S _t ,A _t ) Is in state S for the local network _t Action A with the highest output value _t A' is an action corresponding to the maximum predicted Q value selected based on the local network,

state S output for the target network under the condition of selecting action a _t+n The target Q value of (1).

In some embodiments, the discount factor is 0.9-0.95.

In some embodiments, the setting policy is an epsilon-greedy policy, selecting actions randomly with a probability of epsilon, and selecting actions by the neural network with a probability of 1-epsilon.

In some embodiments, the learning rate of the predetermined deep reinforcement learning model is 0.00005 to 0.0001.

In another aspect, the present invention also provides an indoor space temperature and humidity regulating system, comprising:

a plurality of sensors including a humidity sensor and a temperature sensor, the sensors being disposed in the set space;

the constant temperature and humidity machine is provided with a plurality of air outlets, each air outlet is provided with a plurality of air speed gears, the air speed gears of the air outlets are independently arranged, and the air outlets are arranged in a set space;

and the controller is connected with each sensor and used for acquiring a humidity value and a temperature value, and is also connected with the constant temperature and humidity machine and used for controlling the temperature and the humidity in the set space according to the indoor space temperature and humidity regulation and control method.

In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.

The invention has the beneficial effects that:

in the method and the device for regulating and controlling the temperature and the humidity of the indoor space, the humidity value and the temperature value in the set space are detected by sensors arranged in a distributed structure to be used as a state space, and actions corresponding to states of each time step are selected by adopting a deep reinforcement learning mode. In the reinforcement learning process, the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation are quoted to calculate an observation reward value so as to comprehensively consider the control precision of the temperature and the humidity and the uniformity of each position in a set space, and the reinforcement control method can finally achieve the effect of accurately and uniformly controlling the temperature and the humidity in the set space.

Further, the method adopts a neural network to fit the Q value of the selected action parameter so as to adapt to a continuous state space under the scene of regulating and controlling the indoor space humidity; by sampling and learning the empirical data in the empirical pool according to the priority, the model can pay more attention to the empirical data with larger error of state value estimation in the empirical pool; action selection and value estimation are separated by setting a local network and a target network, so that value overestimation is avoided; by decomposing the pre-estimated reward value of the action parameter into a state value and an advantage function, the convergence can be faster; and the searching capability can be effectively improved by adopting an epsilon-greedy strategy to select the action.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. In the drawings:

fig. 1 is a logic diagram illustrating a deep reinforcement learning process in a method for regulating and controlling temperature and humidity of an indoor space according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of the method for regulating and controlling the temperature and humidity of the indoor space according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a neural network used in the method for controlling the temperature and humidity of the indoor space according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a library structure of a museum according to an embodiment of the present invention;

FIG. 5 is a comparison graph of the convergence of three algorithms nD3QN-PER, DDQN-PER and D3 QN;

FIG. 6 is a graph of a comparison of the performance of the nD3QN-PER algorithm at different learning rates;

FIG. 7 is a graph of the performance of the nD3QN-PER algorithm for different discount factors;

FIG. 8 is a graph of the performance of nD3QN-PER algorithm for different hidden layer sizes;

FIG. 9 is a graph comparing average humidity accuracy, average temperature accuracy, average humidity uniformity and average temperature uniformity for the rule-based-1 strategy and the nD3QN-PER strategy under disturbance conditions;

FIG. 10 is a graph comparing average humidity accuracy, average temperature accuracy, average humidity uniformity and average temperature uniformity of a rule-based-2 strategy and an nD3QN-PER strategy under random disturbance conditions;

FIG. 11 is a graph comparing control durations of 3D3QN-PER, DDQN-PER, D3QN, rule-based-1 and rule-based-2 under an interference condition;

FIG. 12 is a graph comparing the average power consumption of 3D3QN-PER, DDQN-PER, D3QN, rule-based-1 and rule-based-2 under interference conditions;

FIG. 13 is a graph of the temperature change within an observation chamber during a first time period under 3D3QN-PER control conditions;

FIG. 14 is a graph of temperature change in an observation chamber during a first time period under rule-based-1 control conditions;

FIG. 15 is a graph of temperature change within an observation chamber during a first time period under rule-based-2 control conditions;

FIG. 16 is a graph of humidity changes within an observation chamber during a first time period under 3D3QN-PER control conditions;

FIG. 17 is a graph of humidity changes in an observation chamber during a first time period under rule-based-1 control conditions;

FIG. 18 is a graph of humidity changes within an observation chamber during a first time period under rule-based-2 control conditions;

FIG. 19 is a graph of nD3QN-PER algorithm regulation indoor temperature and humidity changes under different month weather conditions;

FIG. 20 is a graph comparing the average indoor temperature and humidity under the control of 3D3QN-PER, DDQN-PER and D3 QN;

FIG. 21 is a model of a warehouse A with 6 sensors and 3 vents deployed;

FIG. 22 is a model of a warehouse B with 4 sensors and 3 vents deployed;

FIG. 23 is a graph comparing the average humidity accuracy, average temperature accuracy, average humidity uniformity and average temperature uniformity of a warehouse model A and a warehouse model B under the control of a 3D3QN-PER algorithm;

FIG. 24 is a graph comparing the average humidity accuracy, average temperature accuracy, average humidity uniformity and average temperature uniformity of the warehouse model A and the warehouse model C under the control of the 3D3QN-PER algorithm;

FIG. 25 is a graph comparing average adjustment time and average energy consumption of the warehouse model A, the warehouse model B and the warehouse model C under the control of the 3D3QN-PER algorithm.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled," if not specifically stated, may refer herein to not only a direct connection, but also an indirect connection in which an intermediate is present.

The reinforcement learning method can learn the optimal strategy through interaction with the environment, and the model-free method can omit the complexity and difficulty of modeling and calculating different environments. Therefore, intensive learning methods are introduced in many research works to control indoor temperature and humidity. The method is characterized in that different reinforcement learning algorithms are used for adjusting indoor settings, such as windows, air conditioner switches, direction swinging and air outlet temperature, so that the comfort of a human body is improved, the accuracy and the uniformity of different positions of the indoor temperature are not accurately considered, the comfort of the human body is more considered, but the human body has a large body sensing range for comfort and is not suitable for cultural relics protection work sensitive to temperature and humidity, a single air port is controlled in the research works, the uniform effect is difficult to achieve, and the condition that external interference suddenly occurs is not considered in the research works, so that the method is required to meet the requirement of accurate and uniform temperature and humidity in a cultural relics protection environment and has higher anti-interference capability.

The invention provides a method for regulating and controlling the temperature and humidity of an indoor space, which is used for running on a controller, wherein the controller is connected with a plurality of sensors and a plurality of actuators in a set space through the Internet of things, the sensors comprise humidity sensors and temperature sensors, and the actuators are air outlets of a constant temperature and humidity machine. The humidity sensor and the temperature sensor can be arranged in a combined manner or can be respectively arranged at different positions according to specific requirements; the constant temperature and humidity machine can run at different wind speed gears under the conditions of set temperature and set high humidity, and after the temperature and humidity are set, the controller controls the constant temperature and humidity machine to change the wind speed gears in the working process, and the method comprises the following steps of S101-S103:

it should be emphasized that, the steps S101 to S103 described in this embodiment are not limited to the order of the steps, and it should be understood that the order of the steps may be changed or parallel.

Step S101: and acquiring humidity values and temperature value groups acquired by the humidity sensors and the temperature sensors according to the appointed interval time as state parameters to form a state space, and taking wind speed gears of the air outlets operating at the set humidity and the set temperature as action parameters to form an action space.

Step S102: outputting corresponding action parameters according to the state parameters of each time step by adopting a preset deep reinforcement learning model and controlling an actuator to execute; acquiring the actual humidity value of each humidity sensor and the actual temperature value of each temperature sensor at each time step, calculating the humidity precision deviation between the actual humidity value of each humidity sensor and the set humidity and the humidity uniformity deviation between the actual humidity values, calculating the temperature precision deviation between the actual temperature value of each temperature sensor and the set temperature and the temperature uniformity deviation between the actual temperature values, and calculating an observation reward value according to the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation corresponding to each time step; and storing the state parameters, the action parameters and the observation reward values corresponding to each time step in an experience pool as experience data.

Step S103: in the deep reinforcement learning process, a preset deep reinforcement learning model samples experience data in an experience pool according to priority, a neural network is adopted to fit the Q value of a selected action parameter, the neural network model is provided with a local network for calculating the predicted Q value of the current time step and selecting the action parameter according to a set strategy, a target network is provided for calculating the target Q value of the next time step, the local network and the target network have the same structure, and the parameters of the local network are updated to the target network at set time intervals; and the neural network decomposes the Q value of the action parameter into a state value part only related to the state and a dominance function part related to both the state and the action; and constructing a loss function based on the predicted Q value, the target Q value and the observed reward values of a plurality of time steps in the future, and learning until convergence by taking the sum of the reward values corresponding to the maximum action parameters of each time step as an optimization target.

In step S101, an agent for deep reinforcement learning is run on a controller, and humidity values and temperature values acquired by a humidity sensor and a temperature sensor are obtained according to a set time interval, specifically, the humidity sensor and the temperature sensor may be set according to the same detection point location or may be set separately. The detection points are uniformly distributed in the set space and can be respectively arranged at specific positions according to the difference of specific detection requirements. And a plurality of temperature values and a plurality of humidity values of each temperature sensor and each humidity sensor obtained by the intelligent body in the controller are used as state parameters to form a state space. The controller forms an action space by using the wind speed gears selected by the air outlets of the constant temperature and humidity machine, and operates by taking the set temperature and the set humidity as targets.

In step S102, the preset deep reinforcement learning model selects and outputs corresponding action parameters according to the state parameters of each time step. And acquiring the observed values of all parameters in the real environment, including the actual humidity values of all the humidity sensors and the actual temperature values of all the temperature sensors, and reflecting the real state of the environment. The purpose of this embodiment is to achieve the effect of accurate and uniform control when humidity and temperature are controlled for a set space, so that the reward value is observed from two aspects of control accuracy and uniformity to evaluate the value of the action. Further, the experience pool is constructed and used for storing experience data.

In some embodiments, in step S102, in calculating the humidity accuracy deviation between the actual humidity value of each humidity sensor and the set humidity and the humidity uniformity deviation between the actual humidity values, the humidity accuracy deviation is calculated by:

wherein H _sc The deviation of the accuracy of the humidity is shown,

indicating the value of humidity detected by the ith humidity sensor at time step t, H _set Indicates the set humidity, k ₁ The number of humidity sensors.

The calculation formula of the humidity uniformity deviation is as follows:

wherein H _unif In order to provide for a deviation in the uniformity of humidity,

wherein, T _sc Indicating deviation of temperature accuracy, T _t ⁱ Indicating the temperature value, T, detected by the ith temperature sensor at time step T _set Indicates the set temperature, k ₂ The number of temperature sensors;

the calculation formula of the temperature uniformity deviation is as follows:

wherein, T _unif Indicating temperature uniformity deviation, T _t ⁱ Represents the temperature value detected by the ith temperature sensor at the time step t,

representing the average value, k, of the temperature values detected by the temperature sensors at time t ₂ The number of temperature sensors.

In some embodiments, calculating the observed reward value according to the humidity accuracy deviation, the humidity uniformity deviation, the temperature accuracy deviation and the temperature uniformity deviation corresponding to each time step comprises:

R _t ＝α ₁ (T _sc +H _sc )+α ₂ (T _unif +H _unif )； (7)

wherein R is _t To observe the prize value, α ₁ And alpha ₂ Are weight coefficients.

In step S103, the present invention synchronously adjusts and controls the temperature and humidity in the setting space, the state space is continuous, and the Q-learning algorithm using the discrete Q-value table is not suitable for the present invention. Therefore, in this embodiment, the neural network is used to fit the value of the selected motion parameter, i.e., the Q value, so that the Q value approaches the optimal Q value. The neural network may be a convolutional neural network or a full-link network.

The preset deep reinforcement learning model samples the experience data in the experience pool according to the priority for learning, in order to introduce the importance of the experience data and improve the sampling learning rate, the embodiment selects and plays back the experience data in the experience pool based on the TD-error, and the larger the TD-error is, the higher the priority is. The TD-error at each time step is the error of the estimation of the state action value at the time step, and the sampling probability of the empirical data at the time step is proportional to the error.

In some embodiments, in step S103, the sampling, by priority, the preset deep reinforcement learning model of the empirical data in the empirical pool includes: obtaining error TD-error of each time step state parameter and priority p of each empirical data _ξ Proportional to the TD-error, the expression is:

p _ξ ∝|δ _ξ |； (8)

wherein, delta _ξ Is the value of said TD-error, p _ξ To the priority, R _t+1 Is the observed reward value of t +1 time step, gamma is the discount factor, S _t+1 State parameter at time step t +1, S _t Is a state parameter of t time step, Q _θ (S _t ,A _t ) Is in state S for the local network _t Action A with the highest output value _t A' is an action corresponding to the maximum predicted Q value selected based on the local network,

Further, in order to avoid over-estimation of the reward value, the neural network in the application constructs a local network and a target network at the same time for decoupling action selection and calculation of a Q value, the local network and the target network have the same structure, and both take a state parameter as an input and take the Q value as an output, and for the current known situation, the state parameter is taken as an input and the Q value is taken as an outputState S of _t Selecting the action with the largest Q value by the local network based on a complete greedy algorithm, and assuming the action as a ₁ Q is Q (S) _t ，a ₁ ) Inputting the motion into the environment to obtain the state S of the next time step _t+1 The state parameter S of the next time step _t+1 Inputting a target network and finding a ₁ Corresponding Q value Q (S) _t+1 ，a ₁ ) Finally, the prediction of the local network is used as a prediction value, and R is used _t+1 +γQ(S _t+1 ，a ₁ ) As an actual value, error back propagation is performed. The loss function can select variance and the like, and the variance can be used as supervised learning. After each period of time, the parameters in the home network are hard copied into the target network.

The neural network decomposes the estimated reward value of the action parameter into a state value part only related to the state and a dominance function part related to both the state and the action, and can also introduce line sampling interference; the neural network considers that the Q network is divided into two parts, the first part is only related to the state s and is not related to the specifically adopted action a, the first part is called a cost Function part and is denoted as v(s), the second part is related to the state s and the action a, the second part is called an Advantage Function (Advantage Function) part and is denoted as a (s, a), and then the final cost Function can be represented again as:

Q(s,a)＝V(s)+A(s,a)； (10)

in some embodiments, constructing a loss function based on the predicted Q value, the target Q value, and the observed reward values for a plurality of time steps in the future in step S103 includes:

defining a sum of returns R for x time steps in the future _t ⁽ⁿ⁾ The formula of (1) is:

wherein, γ ^(x) Discount factor for the xth time step in the future, R _t+x+1 (ii) an observed reward value for the xth time step;

setting the loss function L as:

wherein the content of the first and second substances,

In some embodiments, the discount factor is 0.9-0.95.

In some embodiments, the policy is set to an ε -greedy policy, with actions selected randomly with a probability of ε and actions selected by the neural network with a probability of 1- ε. And the neural network selects the action with the highest predicted reward value in the process of selecting the action.

the constant temperature and humidity machine is provided with a plurality of air outlets, each air outlet is provided with a plurality of wind speed gears, the wind speed gears of the air outlets are independently arranged, and the air outlets are arranged in a set space;

and the controller is connected with each sensor and used for acquiring a humidity value and a temperature value, and the controller is also connected with the constant temperature and humidity machine and used for controlling the temperature and the humidity in the set space according to the indoor space temperature and humidity regulation and control method in the steps S101 to S103.

The invention is illustrated below with reference to a specific example:

the embodiment provides a reinforcement learning method for accurate and uniform regulation and control of museum warehouse humiture, which controls the gears of indoor constant temperature and humidity machines to regulate the humiture through a reinforcement learning algorithm, and comprises the following steps of 201-205:

step 201: designing a simulation scene, and designing elements of a control system, wherein three major elements of a reinforcement learning algorithm are required to be designed according to the current scene due to the use of the reinforcement learning-based control algorithm: status, control actions, and rewards.

Step 202: the method comprises the steps of simulating an actual warehouse scene in CFD simulation software, initializing the simulation scene, and transmitting the current environment state (temperature and humidity conditions) to a reinforcement learning algorithm control intelligent agent.

Step 203: the intelligent agent calculates the current reward value according to the incoming state, selects the optimal action in the current state according to a certain rule through neural network training, and communicates the action to an actuator, namely the constant temperature and humidity machine.

Step 204: and after the constant temperature and humidity machine finishes executing the action, sending the next state to the intelligent body for the next learning. Therefore, the interaction process of the intelligent agent in the environment is formed by the cycle alternation.

Step 205: through training study, intelligence physical stamina is according to the current state, selects optimum action, realizes that indoor humiture reaches accurate even effect, and under the condition that takes place external disturbance, required reply time shortens, and the energy consumption reduces, maintains the humiture at the optimum.

Specifically, a deep reinforcement learning control system is designed, a constant temperature and humidity system is configured in an environment part in a storehouse of a museum, and the system regulates and controls the temperature and the humidity of an indoor environment through a plurality of air inlets. Because cultural relics made of different materials need to be stored in a specific temperature and humidity environment, the temperature and the humidity of air in a storehouse must be strictly controlled; the sensor part adopts a temperature and humidity sensor, can detect the environmental condition at regular time, and uploads the collected data to the controller through an Internet of things (IoT) network. Large area rooms often have problems of non-uniform indoor temperature and humidity. Thus, distributed temperature and humidity sensors are deployed to detect humidity data at different locations in an indoor environment. And a controller part, wherein the controller adopts an algorithm based on DRL (deep recovery learning), the controller aims to keep the temperature and the humidity within a desired range and uniform, and makes a control decision by updating the wind speed gears of a plurality of wind ports according to environmental information feedback uploaded by a plurality of sensors. And in the actuator part, the constant temperature and humidity air conditioning system adjusts the wind speed gear of the ventilation opening according to the decision of the controller, and in the designed system, the wind speeds of different wind openings can be adjusted to different gears.

As shown in the indoor Temperature and Humidity control flow chart shown in fig. 2, in the control process, the distributed Temperature and Humidity sensors upload the collected Temperature and Humidity values of each indoor point to the control system, the DRL-based controller inputs the information uploaded by the distributed Temperature and Humidity sensors as state parameters, trains through the neural network, obtains a corresponding decision, and outputs an action decision to the actuator. In the controller, the DRL neural network structure of the intelligent agent is obtained by a full connection layer component.

A system model of the controller is set and a problem is defined, and a control process is defined as a markov decision process because the indoor air temperature and humidity of the next time slot are determined by the current indoor state, the action of the CTHA system, and are independent of the previous state. Control optimization can therefore be defined as a reinforcement learning problem.

1) Determining indoor temperature and humidity states: the present embodiment DRL based controller makes decisions based on the current indoor multipoint temperature and humidity conditions. Therefore, the state is an important factor. Distributed sensors are deployed to detect indoor environmental information, including temperature and humidity at various points. Acquiring the temperature value and the humidity value detected by the distributed sensor at the moment T as T _t ⁱ And

。

2) temperature and humidity set points: defining the target temperature and target humidity as T _set And H _set This value can be determined according to the optimum temperature and humidity for the protection of the cultural relics. Since it is almost impossible to precisely maintain the target temperature and humidity, the expected deviation of the temperature and humidity is ± 0.5 ℃, ± 1%. The goal of the control algorithm is to reduce as much as possible the deviation from the desired state.

3) Setting the wind speed: the constant temperature and humidity system works at constant temperature and humidity, and the intelligent agent based on the DRL maintains the indoor temperature and humidity in a uniform and accurate state mainly by controlling the wind speed gear of the CTHA. In the present embodiment, the wind speed is defined as F, F ═ off, low, medium, high. I.e., off, low, medium, high four gear positions.

4) Setting the energy consumption of the CTHA system: the energy consumed by the system is proportional to the amount of ventilation. They are used for heating, humidification, etc. The energy consumption will be measured by the smart meters of the system. The energy consumption will be evaluated in units kW · h, taking into account only the overall energy consumption in each time slot.

5) Define problem, define system State (State): the control decision (i.e. the stage values for the plurality of tuyeres) is based on an observation of the current indoor temperature and humidity. The system state of each time slot is composed of the current temperature and humidity detected by a plurality of sensors. The present embodiment defines the system state as:

where β represents the number of temperature and humidity sensors.

Define control actions (actions): the controllable variable is considered as the wind speed gear of a plurality of wind ports of the CTHA system in the embodiment, and the wind speed of each wind port can be selected from four different gears, which is defined as F. The control action is defined as:

wherein m represents the number of tuyeres, f _t ^m And (4) representing the wind speed crotch position of the mth wind gap at the time step t. Thus, the entire motion state space is

Wherein

Reward (Reward): when the agent performs an action in the current state, the environment will enter a new state and receive a reward, which is referred to equation 7 and consists of two parts since it mainly takes into account the accuracy and uniformity of indoor temperature and humidity.

R _t ＝α ₁ (T _sc +H _sc )+α ₂ (T _unif +H _unif )； (7)

Wherein R is _t For the prize value, α ₁ And alpha ₂ Are the weight coefficients.

The first term in equation 7 calculates the temperature and humidity deviation from the target state between each point. Two variables are defined, temperature accuracy T _sc (i.e., temperature accuracy deviation) and humidity accuracy H _set (i.e., humidity accuracy deviation) to measure the accuracy of the indoor temperature and humidity. Wherein, T _t ⁱ ，

Respectively, representing the temperature and humidity detected by each sensor. It is required in the control process that the temperature and humidity at each point are as close as possible to the desired conditions.

The second term in equation 7 focuses mainly on the uniformity of indoor temperature and humidity. The controller agent aims to reduce the uneven temperature and humidity distribution in the room. Thus, the present embodiment defines the measured value H _unif Is the humidity uniformity deviation (i.e., average humidity uniformity), and T _unif Represents the temperature uniformity deviation (i.e., average temperature uniformity), as shown in

equations

17 and 18, where

Is an average value of the temperatures detected by the respective temperature sensors,

is the average value of the humidity detected by each humidity sensor. In the formula 7, α _i (i-1, 2) denotes the weight, indicating the relative importance of the two parts. If the accuracy of the indoor temperature and humidity is more important, the parameter alpha ₁ Should be set to a larger value. Otherwise, it should be adjusted to a smaller value to obtain higher uniformity.

Optimizing the target: the intelligent agent judges whether the action is good or not according to the result generated by the environment. The goal of the method is to learn a motion sequence which can realize the goal, so that the sum of all motion reward values in the whole time period is the highest, and the goal function is expressed as follows:

where γ represents a discount factor and γ < 1. The goal of deep reinforcement learning is to maximize the total of discount rewards.

And (3) algorithm structure design:

since the temperature and humidity values are continuous, the state space is large, and normal Q learning causes space explosion when storing the state action space. The present embodiment therefore decides to employ DRL (Deep Learning) which incorporates Deep Learning, allowing the agent to handle the problem of complex and large dimensional state input. Since the action is a discrete variable, the present embodiment constructs a fitting Q value based on a fully connected neural network.

Based on a Double DQN structure, the neural network model sets a local network for calculating a predicted Q value of the current time step, selects action parameters according to a set strategy, sets a target network for calculating a target Q value of the next time step, the local network and the target network have the same structure, and the parameters of the local network are updated to the target network at set time intervals. Based on current environmental parameters, including temperature and humidity detected by a plurality of sensors, selecting and using an epsilon-greedy strategy to adjust wind speed gears of a plurality of air ports of the CTHA system based on a local network. At the end of each time slot, the prize value, the current prize value and the next time state s are calculated according to equation 7 _t+1 Current action a _t And the current state s _t And combining to form an experience and storing the experience into an experience pool. Randomly sampling mini-batch from experience pool during training and inputting mini-batch into local network Q _θ Network, and target network Q _θ- -computing a loss function in the network, which loss function is to be used for updating the local network Q _θ -weights of network, and these weights are given by oneUpdating to the target network Q at a fixed frequency _θ- -network.

Introducing n-step method, observing the reward of multiple future steps for updating, defining the reward sum R of x time steps in the future _t ⁽ⁿ⁾ The formula refers to formula 11:

the loss function L is set as follows:

wherein the content of the first and second substances,

sum of reward values, gamma, for future n-step observations ⁽ⁿ⁾ For a discount factor of n steps, S _t+1 State parameter at time step t +1, S _t Is a state parameter of t time step, Q _θ (S _t ,A _t ) Is in state S for the local network _t Action A with the highest output value _t A' is an action corresponding to the maximum predicted Q value selected based on the local network,

Further, based on the dual DQN structure, as shown in fig. 3, the local network and the target network evaluate the predicted Q value and the target Q value in two parts, the first part is only related to the state s and is not related to the action a to be taken, this part is called a cost Function part, which is denoted as v(s), and the second part is related to both the state s and the action a, this part is called an Advantage Function (advance Function) part, which is denoted as a (s, a), so that the final cost Function can be represented again as:

Q(s,a)＝V(s)+A(s,a)； (10)

in this embodiment, based on Prioritized Experience Replay, the deep reinforcement learning process stores Experience data by constructing an Experience pool, and priority Experience Replay gives a certain priority to different experiences, so that some more "important" experiences can be sampled at a higher frequency. Priority p of each experience _ξ Will be proportional to the value delta of TD-error _ξ Defined as in formulas 8 and 9:

p _ξ ∝|δ _ξ |； (8)

Its probability distribution P (ξ) and importance weight w _ξ The calculation is as follows:

where N represents the total number of experiences and λ determines the priority ratio. p is a radical of _ξ Indicates the priority of the experience zeta, P (ξ) indicates the probability distribution of the experience, w _ξ Represents the important surname weight, sigma represents the selected proportion of the importance weight, and K represents all experiences in the experience pool.

In this embodiment, Double DQN, Dueling DQN, Prioritized Experience Replay, and n-step are combined, and the deep reinforcement learning method of the structure is defined as nD3QN-PER (n-step Double DQN with Prioritized Experience Replay). Next, for comparison, the deep reinforcement learning method combining Double DQN and Prioritized Experience Replay is also defined as DDQN-PER, and the deep reinforcement learning method combining Double DQN and Dueling DQN is defined as D3QN (Dueling Double DQN).

The nD3QN-PER, DDQN-PER and D3QN were used for comparison, respectively.

In the embodiment, the temperature and humidity control simulation is completed in CFD simulation software, and the algorithm part is completed by using an open source deep learning frame tensorflow 2.0.

Experiments were designed to simulate a museum library of 10 meters (length) 9 meters (width) 3 meters (height) as shown in figure 4. Two doors are respectively arranged in the north and south directions. In the middle of the room, two cabinets storing cultural relics are deployed. The constant temperature and humidity air conditioning system adjusts the indoor temperature and humidity through three air inlets at the top of the room. Nine temperature and humidity sensors are uniformly arranged in the room.

In the experiment of this example, T _set ＝25℃,H _set 50%. The nine sensors upload the detected data to the control system. Thus, the current time slot system state is defined as

The action is defined as A _t ＝{f _t ¹ ,f _t ² ,f _t ³ Is }, F ∈ F, where F _t ¹ ,f _t ² ,f _t ³ Respectively representing the wind speed gear sizes of the three air vents. Suppose that the museum warehouse has a good heat insulation design and does notThere is heat exchange with the outside. The initial temperature and humidity were not uniform in each training set, and each round included 30 steps (2 minutes 1 step). The number of training rounds is greater than 800. In order to improve the anti-interference capability of the control system, the temperature and humidity state of the room is obviously changed in each round by assuming that the interference from the door to the north of the room is obvious. The agent is trained to recognize the interference and to respond to the interference in a shorter time.

There are two hidden layers in the DRL neural network, each layer having 512 neurons. ReLU is used as the activation function. And using an Adam optimizer, the learning rate was set to 0.0001. The discount coefficient is set to 0.9 and the mini-batch size is 32. In the action selection process, an epsilon-greedy strategy is used for development and exploration. ε is initially 1, which eventually decreases to 0.001 over 200 rounds.

The proposed control system was evaluated under different settings. Since there is no similar research work, its performance is compared to two commonly used rule-based methods. rule-based-1 and rule-based-2 control strategies are shown in

equations

22 and 23, respectively. In the real world, the sensors of a CTHA system are typically deployed on top of the sidewalls. In this experiment, it is assumed that 3 are installed. The system employs an on-off control strategy. For example, in the rule-based-2 method, if the average indoor temperature and humidity are outside the expected boundaries (i.e., Δ) _T ＝0.5℃,Δ _H 1%), the wind speed gear for the three vents will be set to high mode. Otherwise, the system will be shut down.

Wherein, T _i Indicates the temperature value detected by each sensor, H _i Indicating the value of the humidity detected by the ith sensor.

A comparison of the performance is made below to evaluate the convergence of three different algorithms nD3QN-PER, DDQN-PER and D3QN, the results of which are shown in FIG. 5. We can observe that nD3QN-PER (in this experiment, 3-step, 3D3QN-PER was used) converges to a higher reward value than D3QN and DDQN-PER. In contrast to D3QN, 3D3QN-PER combines D3QN with 3-step and PER strategies. It observes rewards, status and actions for 3 steps in the future, so it is more visible and stable. The PER method makes up for the disadvantage of uniform sampling, so that the intelligence can effectively learn more valuable experiences. Thus, 3D3QN-PER has better performance than other DQN variants based methods.

Evaluating learning of three different algorithms nD3QN-PER, DDQN-PER and D3QN at different learning rates, it can be seen from FIG. 6 that when the learning rate (learning rate) is too large (e.g., 0.005), the convergence rate is significantly faster. When it is too small (e.g., 0.00005), the algorithm converges at a slightly slower rate. However, too high a learning rate may result in the model learning a sub-optimal solution in an unstable learning process. The higher the reward for model convergence as the learning rate decreases. When the learning rates are set at 0.0001 and 0.00005, the algorithms converge to optimal performance and they are more stable than other curves.

The learning of three different algorithms nD3QN-PER, DDQN-PER and D3QN under different discount factors is evaluated, and FIG. 7 shows the performance comparison of the algorithms under different discount factors gamma (gamma). The discount factor substantially determines the importance of the future reward relative to the current reward. As shown in fig. 7, when the discount coefficient γ is set to 0.7, the convergence value of the algorithm is low. As γ increases from 0.8 to 0.9, the reward will increase slightly, since a lower γ will result in the agent not being seen far away, giving more attention to the current reward. When the gamma is set to 0.95, the curve fluctuates greatly, but the prize difference from 0.9 is not significant.

Further, for nD3QN-PER, the effect of different neuron numbers in the hidden layer on algorithm convergence is compared, as shown in FIG. 8, when the two hidden layers become larger in dimension, the convergence speed becomes faster, because more neurons can bring better learning ability. However, the convergence values at these four settings are not significantly different. In particular, an algorithm with 512 neurons in each hidden layer has a more stable convergence. Thus, an algorithm with 512 neurons in two hidden layers is better overall.

Further, compared with the average accuracy and uniformity of the temperature and humidity under the interference condition of the rule-based-1 method, the rule-based-2 method and the 3D3QN-PER method, 50 rounds of tests are performed to evaluate the accuracy and uniformity under the interference condition. Unlike in the training process, each round consists of 60 steps (2 hours). Namely, temperature and humidity changes are observed within 2 hours, and the temperature and humidity accuracy and uniformity of the three methods are evaluated. In each round, it is assumed that disturbances of random temperature and humidity (e.g., temperature 27 ℃: humidity: 45%, etc.) come in at random times, causing the temperature and humidity in the room to deviate from the target values at certain locations (temperature: 25 ℃, humidity: 50%). And calculating the average temperature and humidity uniformity and accuracy of a plurality of points according to the formulas 15-18. Fig. 9 shows that the 3D3QN-PER method improves the accuracy of temperature and humidity by 26.7%, 23.5% on average, compared to the rule-based-1 method. On average, the uniformity is improved by 22.4 percent and 29.9 percent. Fig. 10 shows that the proposed method exhibits better performance and can achieve higher uniformity and accuracy under interference, compared to the rule-based-2 method. The precision and uniformity of the temperature are improved by 2.1 percent and 19.3 percent, and the humidity is improved by 5.4 percent and 21.8 percent. In addition, the energy consumption is saved by 18 percent. More importantly, it can be observed from fig. 9 and 10 that the 3D3QN-PER based approach is more stable under different interference.

Further, the comparison between the control duration and the energy consumption under the interference of nD3QN-PER, DDQN-PER, D3QN, rule-based-1 and rule-based-2 is analyzed, in order to evaluate the anti-interference capability of the control system, the control time and the energy consumption required by the system to reach the indoor target temperature and humidity state under different control methods are evaluated in the embodiment, and the result is shown in fig. 11 and 12. This shows that the method nD3QN-PER of the present invention is clearly superior to other methods. As can be seen from FIG. 11, the average adjustment time is shortened by 19.7%, 19.8%, 23.8%, 24.2% compared to the methods of D3QN, DDQN-PER, rule-based-1, and rule-based-2, respectively. The performances of the DDQN-PER and D3QN algorithms are not obviously different, and the regulation and control time consumption is slightly lower than that of the two rule-based methods. Since the sensors in the rule-based method are generally installed intensively away from the entrances, the entrances are possible sources of interference, so that the monitoring area of the controller is narrow. The controller takes longer to detect the temperature and humidity changes in the room and results in high latency control. In contrast, however, distributed sensors based on the 3D3QN-PER method can capture changes before disturbing air dispersion and make corresponding adjustments and timely decisions.

Fig. 12 shows the average energy consumption when handling disturbances. The DQN-based method consumes less energy, and in particular, the 3D3QN-PER method saves the most energy, while the rule-based-2 method consumes the most energy. Since DQN-based methods consume less conditioning time on the one hand, and on the other hand, the design of multiple wind ports, multiple gears and distributed sensor deployment makes the system more flexible in dealing with interference. Compared with other 4 methods, the 3D3QN-PER reduces the energy consumption by 21.7%, 22.9%, 26.6% and 34.2% respectively when dealing with interference.

Further, the performance of nD3QN-PER, rule-based-1, and rule-based-2 under the influence of weather environment were compared.

In museums, the temperature and humidity of the storeroom and exhibition hall need to be accurately controlled. However, the temperature and humidity in the exhibition hall are generally affected by the weather. Therefore, the change of the indoor temperature and humidity is observed in consideration of the weather change. At 13/9/13/2021: 00 to 19: the time period 00 is a first time period, and fig. 13-18 show that the 3D3QN-PER, rule-based-1 and rule-based-2 regulate and control the change of the temperature and humidity in the background room in the first time period. Compared with the two rule-based methods, the 3D3QN-PER method can maintain more stable indoor temperature and humidity within a desired fluctuation range. And the temperature and humidity detected by each sensor is closer to the state of the target. Due to the CTHA system adopting the rule-based method, control decision is made by monitoring a plurality of sensors, and when the detected temperature and humidity are within the target threshold, the CTHA system stops working. However, the temperature and humidity in other locations of the room are still out of range. Therefore, at some point, the temperature and humidity deviate from the desired range. Meanwhile, an on-off strategy is adopted based on a rule-based method, so that more temperature and humidity fluctuations are caused. The result shows that the method nD3QN-PER provided by the invention is also suitable for exhibition halls.

In addition, the present embodiment also evaluates the performance of the system in different weather conditions. Outdoor weather data was collected from beijing for 5 months to 9 months. The day of each month was randomly selected for testing and observed for temperature and humidity changes throughout the day. Fig. 19 shows that the average indoor temperature and humidity of one day can be stabilized in an expected state with less fluctuation. The results show that when a weather effect is introduced, the indoor average air temperature and humidity can be maintained within a desired range despite the apparent difference in outdoor weather. Methods based on nD3QN-PER are possible.

Further comparison was made of the average temperature and humidity of nD3QN-PER, DDQN-PER and D3QN control methods over the course of a day. From fig. 20, it can be observed that the average temperature and humidity can be controlled within a desired range under the three methods. Furthermore, it is clear that the 3D3QN-PER method achieves better performance than other methods. The Standard Deviation (SD) of the average indoor temperature of the nD3QN-PER method proposed in this example was 0.44, while D3QN was 0.51 and DDQN-PER was 0.54. The standard deviation of average humidity for nD3QN-PER was 1.2, while the standard deviation for D3QN and DDQN-PER was 2.06 and 2.12. As can be seen, the 3D3QN-PER method has much smaller variation in temperature and humidity, and is more suitable for cultural relic protection.

Further, the evaluation and analysis of the influence of different sensor numbers on the system regulation performance, in order to reduce the actual deployment cost, the embodiment designs a warehouse model a with 6 sensors and 3 air supply outlets, and a warehouse model B with 4 sensors and 3 air supply outlets to evaluate the influence of the sensor deployment density on the system performance, as shown in fig. 21 and 22, the sensors are uniformly deployed in the warehouse.

The 3D3QN-PER algorithm is used for training. Likewise, the temperature and humidity accuracy and uniformity were evaluated for 50 rounds under random disturbances. Fig. 23 shows a comparison of the 6-sensor warehouse model a and the 4-sensor warehouse model B. It can be seen that the precision and uniformity of the warehouse model a and the warehouse model B are not much different, but the performance of the 6-sensor system is slightly better. Further, as shown in fig. 24, the warehouse model C (not shown) with 9 sensors deployed achieves the highest accuracy and uniformity in both temperature and humidity. The 9 sensor system significantly improved the accuracy of temperature and humidity by 14.5% and 17.2% while uniformity by 4.2% and 5.9% compared to 6 sensors.

Furthermore, the anti-interference performance under the condition of different sensor numbers is compared, and meanwhile, the time consumed by the system to regulate and control to a target state under the interference condition is evaluated in energy consumption. As shown in fig. 25, as the number of distributed sensors increases, the system consumes less adjustment time and energy. Compared with the 4-sensor system, the regulation time and the energy consumption of the 9-sensor system are reduced by 31.1% and 22.6%. Compared with the 6-sensor system, the control time and the energy consumption are saved by 24.1 percent and 14.8 percent. This is because the more sensors installed, the more sensitive the system is to environmental changes. However, deploying too many sensors greatly increases the input dimension and requires more time for training of the network. Therefore, the density of sensor deployment should be appropriate.

Considering that the environment for storing the cultural relics has strict requirements on the precision of the temperature and the humidity, and the long-term uniform and stable environment is more favorable for storing the cultural relics, the embodiment provides the reinforcement learning method for accurately and uniformly regulating and controlling the temperature and the humidity of the museum storeroom, and provides a complete solution. The distributed architecture is adopted, and the temperature and humidity of a plurality of different positions in the storehouse are monitored to control a plurality of air ports of the constant temperature and humidity machine, so that the indoor temperature and humidity state can be intelligently regulated and controlled. The regulation and control method based on the deep reinforcement learning algorithm is provided, the current problem is solved, and the intelligent agent can select the optimal control action in the current state under different indoor temperature and humidity states. Multiple objectives are achieved, including: the humiture of each position reaches accurate suitable state, and each point humiture reaches even state to and when external interference takes place, its interference killing feature is stronger.

In summary, in the method and the apparatus for regulating and controlling the temperature and humidity of the indoor space according to the present invention, the sensors arranged in a distributed structure detect the humidity value and the temperature value in the set space as the state space, and the action corresponding to each time step state is selected by adopting a deep reinforcement learning manner. In the reinforcement learning process, the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation are quoted to calculate an observation reward value so as to comprehensively consider the control precision of the temperature and the humidity and the uniformity of each position in a set space, and the reinforcement control method can finally achieve the effect of accurately and uniformly controlling the temperature and the humidity in the set space.

Furthermore, the method adopts a neural network to fit the pre-estimated reward value of the selected action parameter so as to adapt to a continuous state space under the scene of regulating and controlling the indoor space humidity; by sampling and learning the empirical data in the empirical pool according to the priority, the model can pay more attention to the empirical data with larger error of state value estimation in the empirical pool; action selection and value estimation are separated by setting a local network and a target network, so that value overestimation is avoided; by decomposing the pre-estimated reward value of the action parameter into a state value and an advantage function, the convergence can be faster; by introducing Gaussian noise and adopting an epsilon-greedy strategy to select actions, the exploration capability can be effectively improved.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for regulating and controlling the temperature and the humidity of the indoor space is characterized in that the method is used for operating on a controller, the controller is connected with a plurality of sensors and a plurality of actuators in a set space through the Internet of things, the sensors comprise humidity sensors and temperature sensors, the actuators are air outlets of a constant temperature and humidity machine, and the method comprises the following steps:

in the deep reinforcement learning process, the preset deep reinforcement learning model samples experience data in an experience pool according to priority, a neural network is adopted to fit the value Q of each action in the current state, the neural network model is provided with a local network for calculating the predicted Q value of each action at the current time step and selecting action parameters according to a set strategy, a target network is provided for calculating the target Q value of each action at the next time step, the local network and the target network have the same structure, and the parameters of the local network are updated to the target network at set time intervals; and the neural network decomposes the Q value of the action parameter into a state value part only related to the state and a dominance function part related to both the state and the action; constructing a loss function based on the predicted Q value, the target Q value and observation reward values of a plurality of time steps in the future, and learning until convergence by taking the sum of the maximized reward values as an optimization target;

wherein, calculating an observation reward value according to the humidity precision deviation, the humidity uniformity deviation, the temperature precision deviation and the temperature uniformity deviation corresponding to each time step comprises:

R _t ＝α ₁ (T _sc +H _sc )+α ₂ (T _unif +H _unif )；

wherein R is _t For the observed reward value, α ₁ And alpha ₂ Is a weight coefficient, H _sc Represents the deviation of the humidity accuracy, H _unif For the deviation of the humidity uniformity, T _sc Indicating said temperature accuracy deviation, T _unif Representing the temperature uniformity deviation;

the preset deep reinforcement learning model samples the experience data in the experience pool according to the priority, and the sampling comprises the following steps:

p _ξ ∝|δ _ξ |；

wherein, delta _ξ Is the value of said TD-error, p _ξ To the priority, R _t+1 Is the observed reward value of t +1 time step, gamma is the discount factor, S _t+1 State parameter at time step t +1, S _t Is a state parameter of t time step, Q _θ (S _t ,A _t ) Is in state S for the local network _t Action A with the highest output value _t A' is an operation corresponding to the maximum predicted Q value selected based on the local network, and is a state S output by the target network under the condition of selecting the operation a _t+1 A target Q value of (1);

constructing a loss function based on the predicted Q value, the target Q value, and observed reward values for a plurality of time steps in the future, comprising:

defining a sum of returns for n time steps in the future

The formula of (1) is:

wherein, gamma is ^(x) Discount factor for the xth time step in the future, R _t+x+1 (ii) an observed reward value for the xth time step;

setting the loss function L as:

wherein the content of the first and second substances,

2. The method for controlling the temperature and humidity of the indoor space according to claim 1, wherein the humidity accuracy deviation between the actual humidity value of each humidity sensor and the set humidity and the humidity uniformity deviation between the actual humidity values are calculated by:

wherein H _sc The deviation in the accuracy of the humidity is indicated,

the calculation formula of the humidity uniformity deviation is as follows:

wherein H _unif In order to account for the humidity uniformity deviation,

wherein, T _sc Indicating said temperature accuracy deviation, T _t ⁱ Indicating the temperature value, T, detected by the ith temperature sensor at time step T _set Represents the set temperature, k ₂ The number of the temperature sensors;

the calculation formula of the temperature uniformity deviation is as follows:

3. The method for controlling the temperature and humidity of an indoor space according to claim 1, wherein the discount factor is 0.9-0.95.

4. The indoor space temperature and humidity conditioning method of claim 1, wherein the setting policy is an epsilon-greedy policy, actions are randomly selected with a probability of epsilon, and actions are selected by the neural network with a probability of 1-epsilon.

5. The method for regulating and controlling the temperature and the humidity of the indoor space according to claim 1, wherein the learning rate of the preset deep reinforcement learning model is 0.00005-0.0001.

6. An indoor space temperature and humidity conditioning system, the system comprising:

the controller is connected with the sensors and used for acquiring a humidity value and a temperature value, and the controller is also connected with a constant temperature and humidity machine and used for controlling the temperature and the humidity in the set space according to the indoor space temperature and humidity regulation and control method of any one of claims 1 to 5.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the processor executes the program.