WO2023075631A1 - Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air - Google Patents

Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air Download PDF

Info

Publication number
WO2023075631A1
WO2023075631A1 PCT/RU2021/000472 RU2021000472W WO2023075631A1 WO 2023075631 A1 WO2023075631 A1 WO 2023075631A1 RU 2021000472 W RU2021000472 W RU 2021000472W WO 2023075631 A1 WO2023075631 A1 WO 2023075631A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
training
hvac
child
virtual
Prior art date
Application number
PCT/RU2021/000472
Other languages
English (en)
Russian (ru)
Inventor
Александр Юрьевич БЕЛЯЕВ
Максим Анатольевич ЗУБОВ
Сергей Евгеньевич Шалунов
Владимир Анатольевич ПУШМИН
Original Assignee
Ооо (Общество С Ограниченной Ответственностью) "Арлойд Аутомейшн"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ооо (Общество С Ограниченной Ответственностью) "Арлойд Аутомейшн" filed Critical Ооо (Общество С Ограниченной Ответственностью) "Арлойд Аутомейшн"
Priority to PCT/RU2021/000472 priority Critical patent/WO2023075631A1/fr
Priority to PCT/GB2022/051855 priority patent/WO2023073336A1/fr
Publication of WO2023075631A1 publication Critical patent/WO2023075631A1/fr

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B15/00Systems controlled by a computer
    • G05B15/02Systems controlled by a computer electric
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/20Pc systems
    • G05B2219/26Pc applications
    • G05B2219/2614HVAC, heating, ventillation, climate control
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/20Pc systems
    • G05B2219/26Pc applications
    • G05B2219/2642Domotique, domestic, home control, automation, smart house

Definitions

  • the present invention relates to the field of computer technology for controlling at least one heating, ventilation and air conditioning (HVAC) device, as well as for teaching a training system to work with the HVAC device.
  • HVAC heating, ventilation and air conditioning
  • Neural networks are machine learning models that use one or more layers that non-linearly transform data from blocks called neurons to predict outputs for given inputs. Some of these networks are deep networks with one or more hidden layers in addition to the output layer. The output of each layer is used as input to the next layer of the neural network, such as the hidden layer or the output layer. Each level generates its output from the received input data in accordance with the current values of the parameter set.
  • One example of such systems is the continuous control method using deep reinforcement learning described in RU 2686030 C1.
  • the known method includes obtaining a mini-package of experimental tuples and updating the current values of the parameters of the neural network-actor (neural network-actor), containing for each experimental tuple in the mini-package: processing the training observation and training action in the experimental tuple using the neural network-critic to determine the output of the neural networks for the experimental tuple and determining the predictive output of the neural network for the experimental tuple. Updating the current values of the parameters of the neural network-critic using errors between the predictive outputs of the neural network and the outputs of the neural network, and updating the current values of the parameters of the neural network-actor using the neural network-critic.
  • the known invention has disadvantages.
  • disadvantages of the known method may include poor control accuracy of at least one HVAC unit to provide optimal heating, ventilation and air conditioning in the room.
  • the low accuracy of maintaining a constant internal microclimate should be indicated.
  • the problem solved by the claimed invention is to eliminate at least one of the above drawbacks.
  • the technical result of the claimed invention is to improve the control accuracy of at least one heating, ventilation and air conditioning (HVAC) device to improve the accuracy of maintaining a constant internal microclimate and, as a result, reduce energy costs.
  • HVAC heating, ventilation and air conditioning
  • An additional technical result is to minimize the time and computing resources required to train the neural network of the control system with at least one HVAC device.
  • the target metric includes at least one of the following or combinations thereof: electricity and/or power consumption, cost of electricity and/or power, HVAC unit run time, number of changes in HVAC unit operation, deviation from target conditions .
  • the learning system comprises an actor-critic system comprising an actor neural network and a critic neural network.
  • the learning system comprises a global actor-critic learning system and one or more child actor-critic learning systems, wherein: the global actor-critic system comprises a global neural network-actor and a global neural network-critic; one or more actor-critic child systems contain an actor-child neural network and a critic-child neural network.
  • training the neural network of the control system includes: transferring gradients from at least one child learning system to the global learning system and updating parameters of the global learning system based on the gradients received from at least one child learning system; copying parameters from the global learning system to at least one child learning system; and repeating the steps of transmitting gradients and copying parameters, wherein the steps of transmitting gradients and copying parameters are repeated until each of the neural networks of the child system and the global system converge.
  • training the neural network of the control system includes determining the combination of at least two neural networks to provide a given mode of operation of at least one virtual HVAC device in accordance with the said virtual environment.
  • training the neural network of the control system includes training at least one child system depending on a separate copy of the virtual environment, each of the copies of the virtual environment has different initialization properties, and the initialization properties include at least one of the following values or a combination of them: density of people in the room, different configurations at least at least one HVAC virtual appliance and/or different virtual room model configurations.
  • the baseline simulator is configured to provide a separate target metric for at least one child system, with separate target metrics for said child system being determined depending on the corresponding separate copy of the virtual environment.
  • the neural network of the control system is trained depending on a set of target conditions provided by the baseline simulator, where the set of target conditions includes one or more of the conditions: indoor temperature and/or outdoor temperature, indoor humidity and/or humidity outside the premises.
  • the neural network is trained depending on a set of target indicators and/or sets of target conditions provided by the baseline simulator, where each target metric is used by the learning system for a separate training step.
  • training the neural network of the control system includes creating a reinforcement learning agent configured to operate depending on the training system, wherein the agent is configured to interact with at least one virtual HVAC device in said virtual environment.
  • the baseline simulator is configured to provide a plurality of targets and/or target conditions depending on the maximum number of actions that said agent can take during a predetermined learning period.
  • the baseline simulator is configured to provide a target metric depending on the input dataset containing the time-limited set data, while the neural network of the control system is trained depending on the same time-limited data set.
  • the baseline simulator is configured to provide a target metric without using the neural network of the control system and/or to provide a target metric using one or more of the following methods: a decision tree method, a linear data transformation method, analysis-based methods connections, methods using gradient boosting.
  • the baseline simulator is configured to provide a target metric depending on one of the following: user input to control the operation of at least one HVAC device in said virtual environment and/or stored data related to the operation of the virtual HVAC devices .
  • the input dataset contains one or more of the following data: HVAC device description and/or configuration data in said virtual environment, building layout and/or area layout data, climate condition data.
  • training the neural network of the control system includes providing a plurality of said agents, where each of the agents is configured to perform a plurality of actions and/or communicate with at least one HVAC device.
  • training the neural network of the control system includes training a plurality of learning systems, where training of each learning system of the plurality of learning systems occurs in said environment for each virtual HVAC device and/or for each group of a plurality of groups of similar virtual HVAC devices.
  • training a control system neural network includes training a plurality of learning systems associated with a plurality of said agents, each learning system being configured to interact with a respective agent.
  • training the neural network of the control system includes: training the first child system of the first learning system and the first child system of the second learning system depending on the first copy of said virtual environment; training a second child system of the first learning system and a second child system of the second learning system depending on the second copy of said virtual environment; wherein the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment simultaneously and/or the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment according to a predetermined condition.
  • a method for controlling at least one heating, ventilation and air conditioning (HVAC) device comprising the steps of: comprising the steps of: receiving an input data set; create a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which said at least one virtual HVAC device model; performing an operation mode simulation of said at least one virtual HVAC device in said virtual model premises based on the received input dataset; obtaining a target metric based on the simulation performed; and performing training of the neural network of the control system, at least one HVAC device in accordance with the received target metric and in accordance with the input data set; the controller generates control instructions; and transmitting said instructions to at least one HVAC device, where said control instructions are generated based on the received target metric.
  • HVAC heating, ventilation and air conditioning
  • a computer readable medium for storing a computer program product containing instructions that, when executed by a processor, cause the processor to perform a method for controlling at least one heating, ventilation, and air conditioning (HVAC) device.
  • HVAC heating, ventilation, and air conditioning
  • a method for training a neural network of a control system with at least one heating, ventilation, and air conditioning (HVAC) device comprising the steps of: receiving an input data set; create, by means of a virtual environment module, a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the mentioned is located, at least one virtual HVAC device model; executing, with the baseline simulator, an operation mode simulation of said at least one virtual HVAC device in said virtual room model based on the received input data set; obtaining a target metric based on the simulation performed; and performing training of the neural network of the control system in accordance with the obtained target metric and in accordance with the input data set.
  • HVAC heating, ventilation, and air conditioning
  • FIG. 1 shows an exemplary view of a virtual environment containing an HVAC device.
  • FIG. 2 shows a system including a reinforcement learning system.
  • FIG. 3 shows a computer device on which the system shown in FIG. 2.
  • FIG. 4 shows a neural network of a control system for at least one HVAC unit.
  • FIG. Figure 5 shows the "actor-critic" ("performer-critic") system.
  • FIG. 6 shows a detailed version of a reinforcement learning system for the system shown in FIG. 2.
  • FIG. 7 shows the interaction of child and global neural networks in accordance with FIG. 6.
  • FIG. 8 shows a method for updating the weights of the child neural networks and the global neural network in accordance with FIG. 6.
  • FIG. 9 shows interactions between agents, reinforcement learning systems, and virtual environment copies that may be present in the system of FIG. 2.
  • FIG. 10 shows a method for providing a result based on an action proposed by a reinforcement learning system.
  • FIG. 1 shows a room 10 which includes two radiators 12-1, 12-2 (special cases of HVAC devices) and an air conditioning unit 14 (special case of HVAC device).
  • Each of these HVAC units can be used to change indoor conditions.
  • radiators can be used to heat a space
  • an air conditioner can be used to cool a space.
  • a neural network for controlling an HVAC device is disclosed, as well as a method for training this neural network and a control system for at least one HVAC device containing such a neural network. It should be understood that the methods and systems disclosed herein may also be used in other situations (not related to HVAC).
  • the neural network may be present in the HVAC device itself.
  • an HVAC device may be present in the control system that controls the device.
  • a controller may be provided comprising one or more neural networks, this controller being able to control the HVAC system and perform training of said control system neural network, the HVAC system comprising a plurality of HVAC devices.
  • reinforcement learning system 110 the neural network training system 110 of the control system with reinforcement (hereinafter reinforcement learning system 110) is trained depending on: one or more sets of input data 102, 103, 104, virtual environment module 100, and baseline simulator 400.
  • the virtual environment module 100 may be implemented by a separate computing device in conjunction with software and configured to create a virtual environment associated with one or more HVAC devices, including creating, based on the received input data set, at least one virtual model of the HVAC device , and at least one virtual model of the room in which said at least one virtual model of the HVAC device is located.
  • the input datasets comprise a device description and configuration dataset 102, a building description dataset 103, and a weather dataset 104. It should be kept in mind that, more generally, input datasets may contain a number of input data and that different datasets may be used in practice.
  • HVAC heating, ventilation and air conditioning
  • the set of device descriptions and configurations may, for example, refer to the available configurations of air conditioners, thermostats, humidifiers, heat exchangers, etc. More generally, one of the input datasets is typically a dataset that defines an action space and/or a set of available actions for an agent (eg turning an air conditioner on or off). Then based on the reinforcement learning software system suitable actions are identified that the agent should take in practice (after the reinforcement learning system has been trained). For example, a set of weather conditions can be used as input to a reinforcement learning system, and agent 200 can turn an air conditioner on or off based on the output of the reinforcement learning system. In the basic example, the reinforcement learning system can identify that the air conditioner should be turned on when there are high temperature weather conditions.
  • the agent means the performer of control actions. It has one or more service capabilities that form a single and complex execution model, which may include access to external software (SW), users, communications, HVAC devices, etc.
  • SW external software
  • an agent is a module for managing external devices.
  • the agent is implemented as firmware and may be implemented, for example, as an HVAC device control computer, an HVAC device control controller located in and/or outside the HVAC device, etc.
  • the input datasets may include building description 103 datasets.
  • the description of a building may indicate the dimensions, plans, and/or zonal subdivisions of the building.
  • Reinforcement learning system 110 is then trained to control the available devices to achieve the desired building conditions.
  • the input datasets may include a weather dataset 104 (which, more generally, may be an environmental dataset).
  • This data set can be used in conjunction with the building description to determine appropriate agent actions that are relevant to the operation of the HVAC units.
  • changes in weather conditions for example, changes in temperature
  • changes to the operation of the HVAC unit may require changes to the operation of the HVAC unit if the desired indoor conditions are to be maintained.
  • Each input data set 102, 103, 104 is configured to be passed to the virtual environment module 100.
  • the virtual environment module 100 is configured to send data to the reinforcement learning software system.
  • the system of reinforcement learning software is designed to transmit data to the agent 200, which is configured to control the parameters of the environment (for example, to control one or more devices in the building).
  • the reinforcement learning software system is configured for learning depending on the configuration of the virtual environment module 100 and hence the input datasets.
  • the virtual environment module 100 contains information related to the environment.
  • the virtual environment module 100 may contain a database of HVAC devices along with their possible control signals and their location (what information can be obtained from the input datasets).
  • the agent 200 can interact with the virtual environment module 100 to manage the devices present in the virtual environment.
  • virtual environment module 100 may contain the dimensions of the room as well as information (eg, possible states) for each of the HVAC devices 12-1, 12-2, 14 in the room. Module 100 may further contain information about the configuration of the room, such as whether the room has furniture and/or whether the door is open, and information about activities in the room, such as whether the room is full or empty.
  • virtual environment usually refers to a larger virtual environment than a room (eg, a building or multiple buildings).
  • the baseline simulator 400 is configured to provide a simulation of the behavior of at least one virtual HVAC device in said virtual room model based on the received input data set.
  • the simulator 400 is defined using a method other than reinforcement learning, such as a statistical model.
  • the baseline simulator 400 is designed to receive input from at least one of the input datasets.
  • the simulator 400 is typically adapted to receive a HVAC device description and configuration data set 102 and/or a building description data set 103.
  • simulator 400 is configured to receive each of the input datasets to provide a basic output dataset (given those datasets).
  • the simulator 400 is configured to provide achievable outputs related to controlling the virtual environment using conventional modeling techniques.
  • simulator 400 may simulate control of devices in a virtual environment based on equations or based on conditional logic (eg, "if the temperature rises above 27°C, turn on the air conditioner").
  • conditional logic eg, "if the temperature rises above 27°C, turn on the air conditioner"
  • simulator 400 can simulate output conditions (humidity and temperature) that can be achieved using known tools and/or existing models.
  • the simulator 400 is used to determine and provide the desired conditions environment (environment).
  • baseline simulator 400 may provide the desired temperature and/or humidity. These desired conditions may be based on one or more of: user input, historical data, and/or simulation (eg, simulation that determines achievable values given user input). A reinforcement learning system can then be set up to achieve these target conditions.
  • the simulator 400 is typically designed to provide a key metric related to providing certain conditions.
  • simulator 400 may provide power usage statistics, device usage statistics, and/or device wear statistics that are required to obtain a set of conditions.
  • This metric can be used as a goal by a reinforcement learning software system, where the reinforcement learning system is designed to provide the same or similar conditions with an improved metric (eg, lowering the cost of electricity).
  • the metric refers to one or more of the following: electricity and/or power consumption; the cost of electricity and/or power (electricity is usually cheaper at certain times of the day, so minimizing usage and minimizing costs may require different actions); operating time of the device; the number of times the operating mode of the device is changed (this may affect the life of the device); and deviation from desired conditions (eg, maximum deviation, average deviation, and/or sum of deviations).
  • simulator 400 provides a variety of metrics.
  • the provided metric may refer to a plurality of constituents. metrics.
  • simulator 400 may indicate (eg, based on user input) the order of precedence for the metrics.
  • simulator 400 may indicate that a reinforcement simulator should prioritize achieving a target condition set and then seek to minimize costs.
  • the simulator 400 may indicate that a small deviation from the target condition set is acceptable if the deviation would significantly reduce power consumption.
  • performing training of the neural network of the control system in accordance with the stages and, ultimately, obtaining a target metric for further control of at least one HVAC device ensures high accuracy in maintaining a constant indoor microclimate to achieve comfortable indoor climate conditions. This circumstance is due to the fact that continuous training in accordance with the data characterizing the individual implementations of both HVAC equipment and the possible implementations of the premises ensures the achievement of positive results in maintaining comfortable climatic conditions in the premises.
  • simulator 400 is configured to use historical data. For example, the operation of an existing HVAC system for several months can be used to create a simulator 400. This historical data can be used to determine the conditions that are desirable for the inhabitants of this building, and to determine the operation of the devices necessary to achieve these conditions. This historical data can also be used to form a goal that the reinforcement learning system 110 must meet and/or beat.
  • the method for configuring the simulator 400 includes monitoring the environment and/or HVAC devices to determine baseline operation, preferably over a period of at least a week and/or at least two weeks and/or at least a month. The simulator 400 can then be configured to provide an indication of the operation of the existing system in the environment.
  • the target metric may depend on historical data and/or environmental monitoring. Similarly, a target metric can be based on predicted improvement. For example, the user may predict (set a condition) that a 20% efficiency improvement will (should be) achieved over current systems. The target metric can then be based on a 20% reduction in historically determined energy consumption.
  • Creating a virtual environment may include one or more of the following steps:
  • i. user-entered data for example, user-entered values for temperature, humidity, etc.
  • building information 103 and device description and configuration 102 iii. information about weather conditions 104 during the period corresponding to the simulated period. For example, for the month of August of any year. You can also use multiple time periods (over several years) to improve the reliability of the simulation.
  • simulator 400 uses models trained on available, eg, time-limited, datasets. Models can use decision tree based methods or gradient boosting methods, etc.
  • the resulting simulation is subsequently passed as part or all of the data characterizing the state of the virtual environment to the reinforcement learning system 110 .
  • the simulator 400 may be given weather conditions for a historical month (eg, September 2000) and may then provide simulation output regarding the actions required to maintain the desired conditions for that period of time given those weather conditions. .
  • This output can contain performance metrics such as power consumption, and the reinforcement learning system can use these performance metrics as a target.
  • simulator 400 can provide a target performance metric and/or target condition set to reinforcement learning system 110.
  • Reinforcement learning system 110 includes at least one a neural network that is trained using the same input data set and matches the input data from this data set with a set of actions that must be performed by the agent 200 to achieve or improve the target conditions and/or indicators (parameters).
  • a reinforcement learning system can be used to achieve desired/target building conditions given certain conditions and certain inputs (e.g., a reinforcement learning system can maintain a certain desired temperature range in a building).
  • the reinforcement learning system is usually able to achieve these conditions by outperforming the simulator 400 with respect to the target metric (for example, a reinforcement learning system can reduce power consumption).
  • the number of outputs of the simulator 400 generally refers to the number of steps taken to train the HO system of reinforcement learning; this number of steps corresponds to the maximum number of actions that agent 200 can perform before the end of training.
  • the time interval chosen to train the reinforcement learning system 110 is one month, and the agent 200 is configured to perform an action (e.g., turn the HVAC devices on or off) once per day
  • the simulator 400 may be configured to providing a set of inference conditions corresponding to the number of days in that month, such that at each step performed by the agent 200 during the time interval selected for training, there is a known set of target inference conditions provided by the simulator 400.
  • the simulator 400 is typically designed to provide a metric for each step (for example, daily electricity costs); simulator 400 may additionally or alternatively provide a metric for the situation as a whole (e.g. total electricity costs).
  • the simulator 400 (and the HO reinforcement learning system) are typically trained using data over a significant period of time, such as several months. During this period, weather conditions and/or desired environmental conditions may change. For example, winter weather conditions may be colder than average, and the power output of simulator 400 may be correspondingly higher than average. Considering a long period of time, a reinforcement learning system can be trained to perform at a high level in a variety of conditions.
  • the simulator 400 can perform complex simulations to obtain target conditions/metrics, these conditions/metrics can also be obtained in a simpler way, for example, a simple linear or non-linear transformation of physical parameters that were present before the simulation, as well as methods using analysis relationships between these physical parameters.
  • An example of the simulation that can be provided by the baseline simulator 400 is the method described in the E+ (EnergyPlus) specification ("Technical Handbook, EnergyPlusTM Documentation Version 9.4.0 (September 2020); US Department of Energy: https ://energyplus.net/sites/all/modules/custom/nrel_custom/pdfs/pdfs_v9.4.0/ EngineeringReference.pdf").
  • This open source software requires as input a set of historical weather data for the relevant period or periods of time. For example, if the simulation is to be run in September, data from September of the previous year or previous years is required. several years. As a conclusion, the necessary information is provided on the use of HVAC units to achieve the desired conditions.
  • the agent 200 is configured to interact with the virtual environment module 100 based on the output of the reinforcement learning system 110.
  • the agent 200 is configured to control devices related to the virtual environment to achieve the desired environmental conditions.
  • the agent 200 operates on the output of the reinforcement learning software system, where the DL reinforcement learning system receives a set of input values and then provides the agent 200 with a set of actions (e.g., outputs), where these actions are related to the operation of the HVAC device in the environment. environment.
  • the agent 200 then implements the actions.
  • a set of weather conditions can be used as input to a reinforcement learning system; then, the reinforcement learning software system determines (using a neural network, as described below) the appropriate actions to take to obtain the desired set of conditions given those weather conditions.
  • These appropriate actions are passed to the agent 200, which interacts with the HVAC devices in the environment to implement the actions (eg, to turn the air conditioner on or off).
  • Agent 200 may be combined with reinforcement learning system 110, where the reinforcement learning software system includes an agent and is designed to interact directly with HVAC devices in the environment.
  • a separate agent 200 and/or reinforcement learning software system may be provided for each HVAC device in the environment.
  • there may be a separate agent 200 and/or a reinforcement learning software system provided for each group of HVAC units in the environment for example, an air conditioner group and a radiator group may use two separate agents.
  • a reinforcement learning method in which the reinforcement learning software system is trained so that it is suitable for controlling the agent 200.
  • the agent 200 To interact with the virtual environment module 100, the agent 200 usually receives data indicative of its state, and takes an action from a set of actions as a response to this data, this action is determined by the reinforcement learning software system. Data characterizing the state of the environment can be called observation or input. The goal of the system as a whole is to maximize the "reward" received for performing activities in the described environment, where the reward is related to the achievement/exceeding of one or more indicators (eg, provided by the simulator 400).
  • the present disclosure also relates to a method for controlling at least one HVAC device based on a trained reinforcement learning software system and/or agent 200.
  • the reinforcement learning software system selects the actions to be performed by the agent 200 by interacting with the virtual environment module 100. That is, the reinforcement learning system 110 receives data related to the environment and environmental conditions and selects an action from a set of actions to be performed by the agent 200. Typically, an action is selected from a contiguous set of actions. Similar an action may be selected from a discrete set of actions (eg, a limited number of possible actions).
  • each of the components of the described system is typically implemented using the computing device 1000.
  • Each of these components may be implemented using the same computing device, or the components may be implemented using multiple computing devices.
  • the computing device 1000 includes a processor in the form of a central processing unit (CPU) 1002, a communication interface 1004, a memory 1006, a storage 1008, a removable storage 1010, and a user interface 1012 connected to each other by a bus 1014.
  • the user interface 1012 includes a display 1016 and a device input/output, which in this embodiment is the keyboard 1018 and the mouse 1020.
  • the CPU 1002 executes instructions, including instructions stored in memory 1006, storage device 1008, and/or removable storage device 1010.
  • Communication interface 1004 is typically an Ethernet network adapter connecting bus 1014 to an Ethernet outlet.
  • the Ethernet connector is connected to the network.
  • the Ethernet jack is usually connected to the network via a wired connection, but the connection can alternatively be wireless. It should be understood that many other means of communication (eg, Bluetooth®, infrared, etc.) may be used.
  • Memory 1006 stores instructions and other information for use by CPU 1002.
  • Memory is the main memory of computing device 1000. It typically includes both random access memory (RAM) and read only memory (ROM).
  • RAM random access memory
  • ROM read only memory
  • the storage device 1008 may be implemented as a mass storage device for the computing device 1000. In various implementations, the storage device is a built-in storage device in the form of a hard disk, flash memory, or some other similar mass storage device or an array of such devices.
  • Removable storage device 1010 may be implemented as an auxiliary storage device for computing device 1000.
  • removable storage device is a storage medium for a removable storage device, such as an optical disk, for example, a digital versatile disk (DVD), portable flash a drive or some other similar portable solid-state storage device (computer-readable medium), or a plurality of such devices.
  • the removable storage device is remote from the computing device 1000 and includes a network storage device or a cloud storage device.
  • Computer device 1000 may include one or more graphics processing units (GPUs), application specific integrated circuits (ASICs), and/or one or more field programmable gate arrays (FPGAs).
  • GPUs graphics processing units
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • a computer program product includes instructions for performing aspects of the method(s) described below.
  • the computer program product is stored at various stages in any of the storage devices 1006, the storage device 1008, and the removable storage device 1010.
  • the storage of the computer program product is long term, except when the instructions included in the computer program product are executed by the CPU. 1002, in which case the instructions are sometimes temporarily stored in the CPU or memory.
  • the removable storage device is removable from the computing device 1000 such that the computer program product is stored separately from the computing device from time to time.
  • reinforcement learning system 110 is trained using a first computing device. Once reinforcement learning system 110 has been trained on this first computing device, it can be used to control a system (eg, a HVAC system). To this end, the trained reinforcement learning software system and/or agent 200 may be output and/or transferred to another computing device.
  • a system eg, a HVAC system
  • the computing device 1000 is configured to receive input either through a sensor or through a communication interface input 1004.
  • the input data may include one or more of the following: environmental conditions, weather conditions (for example, a weather forecast), desired conditions (for example, a desired temperature entered by the user), and/or environmental configuration (for example, an indication of whether each building door).
  • FIG. 4 shows an exemplary neural network 10 that may be part of a reinforcement learning system 110.
  • the neural network 10 in FIG. 4 is a deep neural network that includes an input layer 12, one or more hidden layers 14, and an output layer 16. It will be appreciated that the example in FIG. 4 is just a simple example and in practice the neural network may contain additional layers. Also, although this exemplary neural network is a deep neural network, containing a hidden layer 14, more generally, a neural network could simply contain any layer that maps an input layer to an output layer.
  • a neural network is based on parameters that map inputs (eg, observations) to outputs (eg, actions) to achieve a desired outcome for a given input. These parameters usually contain weights. To determine the parameters that ensure efficient operation, the neural network is trained using training sets of input data. In the present system, the inputs to the neural network may be referred to as observations, with each observation characterizing the state of the environment (eg, current conditions).
  • neural network training consists of several stages, in which the parameters of the neural network are updated after each stage of training based on the performance of the neural network at that stage. For example, if a parameter change is defined to improve the performance of the neural network during the first training step, a similar change can be implemented before the second training step.
  • training a neural network involves a series of steps to ensure that the parameters provide the desired output for a range of input datasets.
  • the neural network is trained based on the simulator 400, where the metric provided by the simulator 400 allows the neural network to be trained faster than using traditional training methods.
  • a reinforcement learning system HO typically comprises an actor-critic (Actor-Critic) system 20 .
  • the actor-critic system 20 comprises two separate neural networks, one of which is an actor neural network (by the performer neural network - Actor) 22, and the other by the critic neural network 24 (Critic).
  • the actor neural network (Actor) 22 is designed to receive input from the virtual environment module 100 and have it processed in response to actions that the agent 200 can perform. as input from the environment, and then determine that the agent 200 should turn on the air conditioning unit (note that the actor neural network usually maps input tuples to output tuples without knowing the meaning of these tuples, so the actor neural network does not know that this output refers to the switched on air conditioner).
  • the critical neural network (Critic) 24 is designed to receive a state, or input, from the virtual environment module 100 and an action from the neural network 22 (Actor), and to determine a value related to this action and state. Therefore, the Critic neural network determines the result of the action proposed by the Actor neural network.
  • the simulator 400 may indicate that a temperature of 23°C is desired.
  • the virtual environment module 100 may then provide a state value that indicates that the temperature is 25°C for the first time.
  • neural network actor 22 may return an action to turn on the heat emitter.
  • Critic neural network 24 determines the result of this action - for example, this action is likely to lead to an increase in temperature (thus, the temperature for the second time can be 27 ° C).
  • the critical neural network 24 receives the environment state values for the first and second time and determines that the temperature has risen and that the proposed action had a negative effect.
  • the parameters of the neural network-actor 22 are changed so that in the same situation, a similar action is not offered in the future. Therefore, the parameters of the neural network-actor 22 (Actor) are changed based on the feedback from the neural network-critic 24.
  • the information provided to the neural network actor 22 (Actor) by the neural network 24-criticism (Criticism) contains a temporal difference (TD) error.
  • This TD error takes into account the passage of time so that it can be taken into account, for example, that turning on a radiator may only have an effect after a certain waiting period).
  • the feedback from the critic neural network 24 may include an indication of an error between the condition reached and the desired condition, with the parameters of the actor neural network 22 (Actor) being updated to minimize this error.
  • Neural network-critic 24 provides a subjective assessment of the benefits of the current state of the neural network-actor 22 (Actor). To improve this score, especially in the early stages of training, the simulator 400 can be used to determine the appropriate parameters for the critic neural network 24. In particular, the reward used to train one of the neural networks can be based on the difference between the target metric and the metric, associated with the current parameters of this neural network. In practice, the action offered by the neural network actor 22 (Actor) is usually associated with a metric (eg, electricity usage). The modifications made to the parameters of the actor neural network 22 (and/or the critic neural network 24) during the training phase may depend on the difference between this metric and the target metric provided by the simulator 400.
  • a metric eg, electricity usage
  • the parameters of the neural network-critic 24 and thus the determined value may depend on the condition and/or metric received from the simulator 400.
  • the neural network-critic 24 may identify that the proposed set of actions of the neural network -actor 22 (Actor) refers to an electricity cost metric of 100 kWh, however, without the context of a certain situation, the critical neural network 24 cannot determine whether this price is good or bad, and the critical neural network 24 usually needs to know whether it is whether a certain outcome is good or bad, through numerous learning steps. Therefore, in the early stages of training, the critic neural network 24 can only provide limited feedback to the actor neural network 22 (Actor), and the parameters of the actor neural network 22 (Actor) can only be changed slowly.
  • the critical neural network can learn much faster. For example, if the simulator 400 achieved an electricity cost metric of 20 kW/h for the same situation, the critic neural network 24 can quickly determine that the current parameters of the actor neural network 22 (Actor) that resulted in the electricity cost metric of 100 kWh/ h, differ significantly and are not optimal.
  • the training time of the reinforcement learning system 110 can be reduced.
  • the simulator 400 can be used to quickly train the actor neural network 22 (Actor) and the critic neural network 24 to provide an agent 200 that achieves performance at least similar to the simulator 400. Then the neural network actor 22 (Actor) and neural network-critic 24 can continue training so that the reinforcement learning system begins to outperform simulator 400. Using simulator 400 reduces the total training time required for the reinforcement learning software system.
  • Actor-critic system training may include: a. receiving mini-packages of experience tuples from the virtual environment. Each experience tuple contains a learning observation characterizing the learning state of the virtual environment, a learning action from the possible action space for agent 200, a learning reward associated with agent 200 for performing the learning action, and a next learning observation characterizing the next learning state of the virtual environment; b. updating the current values of the parameters of the neural network-actor 22 using the received mini-packets of experience tuples. For each experience tuple in the mini-batch, a new update typically contains the following steps: c.
  • a new experience tuple is created.
  • This stage usually consists of the following steps: a. getting a new training observation; b. processing this new learning observation using the neural network actor 22 (Actor) to select a new learning action to be performed by the agent 200, in according to the current values of the parameters of the neural network-actor 22; V. receiving a new learning reward in response to agent 200 performing this new learning action; d. obtaining a new next training (training) observation; e. creating a new experience tuple that includes, as described above, a new learning observation, a new learning action, a new learning reward, and a new next learning observation.
  • An example of an existing system that uses the actor-critic scheme is the Asynchronous Advantage Actor-Critic (AZS) https://www.machineleamingmastery.ru/the-idea-behind- actor-critics-and-how-a2c-and-a3c-improve-them-6dd7dfd0acb8/).
  • the methods disclosed herein may be implemented using such a system.
  • the methods disclosed herein can equally be implemented using a Mixture-Density Recurrent Network (MDN-RNN).
  • MDN-RNN Mixture-Density Recurrent Network
  • reinforcement learning system 110 does not include an actor-critic system. Reinforcement learning system 110 will still include at least one neural network. In such systems, simulator 400 typically provides feedback on the reward model that is used to evaluate the performance of the neural network.
  • FIG. 6 shows an embodiment of reinforcement learning system 110 in more detail.
  • the reinforcement learning system 110 is designed to select actions (outputs) using the global neural network module of the virtual environment 300.
  • the global neural network is a neural network that is configured to accept a set of inputs and process the input to associate those inputs with an action (for example, if the temperature rises, turn on the air conditioner). Typically, this includes selecting a point in the continuous action space by the global neural network that determines the action to be performed by the agent 200 in response to the input.
  • This global neural network 300 is typically trained through reinforcement learning.
  • a global neural network can be trained via supervised or unsupervised learning (so that the disclosure of reinforcement learning system 110 is more generally the disclosure of a learning system that contains at least one neural network).
  • the global neural network 300 typically includes a global actor neural network 301, which provides a function for matching inputs to outputs (e.g., actions), and a global critic neural network 302, which is designed to perform actions, and inputs ( states) as input and to process actions and inputs to create neural network outputs.
  • the reinforcement learning software system manages the parameters of the critic global neural network and the actor global neural network.
  • the reinforcement learning system comprises at least one child neural network 310-1, 310-2, 310-N; each child neural network contains a child neural network-actor 311-1, 311-2, 311-N and a child neural network-critic 312-1, 312-2, 312-N.
  • the networks interact with the virtual environment module 100 at the same time (as shown in FIG. 6). This is achieved by providing respective copies of the virtual environment 101-1, 101-2, 101-N with a separate copy of the virtual environment created and provided for each of the child neural networks.
  • Each copy of the virtual environment has different properties that can be set using different initialization conditions.
  • these properties refer to different configurations of the target real physical environment (eg, different building configurations).
  • each copy of the virtual environment may refer to a different population density or a different configuration of objects in the real physical environment (eg, different sets of doors that open or close).
  • Copies of the virtual environment may be initialized based on a number of parameters, each of which may be randomly selected or selected by the user. These parameters typically include one or more of the following: initial environmental conditions (eg initial temperature or humidity); initial state outside the environment; initial employment and/or occupancy; and characteristics of the devices (for example, the type of coolant used).
  • each child neural network 310 is trained based on a different environment with different properties.
  • child neural networks may be structurally identical and/or may be initialized with the same parameters, and these parameters will differ due to the interaction of child neural networks with different versions of the environment. Similar child neural networks can be initialized with different parameters.
  • a plurality of similar copies of the virtual environment can also be provided, where these similar copies are associated with differently initialized child neural networks.
  • simulator 400 is configured to provide a metric and/or set of target conditions for each instance of virtual environment 101. For example, modeling a baseline for a crowded environment may result in higher energy consumption (and other measure of energy usage) than in a sparse environment.
  • the reinforcement learning software system is configured to use the child critic neural networks 312 to determine updates to the child actor neural networks 311 current parameters and child critic neural network weights. After each step, the parameters of child actor neural networks 311 and child critic neural networks 312 change depending on certain updates.
  • Child actor neural networks 311 and child critic neural networks 312 are sent in accordance with process 35 depicted in FIG. 6 and 7, the results of their work (for example, gradients that provide an indication of the performance of these child neural networks 310) to the global neural network 300.
  • the global neural network then accumulates these training results and uses them to change the parameters of the global actor neural network 301 and the global critic neural network 302.
  • global neural network 300 is able to identify parameters and/or parameter modifications that result in improvements for various child neural networks 310 so that global neural network 300 parameters can be updated accordingly.
  • the structure of each neural network is typically as follows: The global neural network 300 is trained by child neural networks 310. networks are defined based on gradients and/or parameters received from child neural networks.
  • the parameters of the global neural network 300 are determined depending on the average value of the gradients provided by the child neural networks 310. These gradients can be applied to the existing parameters of the global neural network (and, more specifically, to the existing parameters of the global neural network-actor 301 and the global neural network-critic 302) to get updated parameters.
  • the child neural networks 310 are configured to periodically receive parameter updates from the global neural network 300 when a condition is met (eg, when a certain number of training steps have been completed) and/or when a given set of actions has been performed.
  • a global neural network is designed to periodically send copies of its parameters to child neural networks, as shown in FIG. 7, where these parameters can replace the existing parameters of each child neural network.
  • Child neural networks 310 are trained based on different copies of the virtual environment 101 having different properties, so they are designed for different situations. For example, the first child neural network 310-1 may provide optimal performance when the environment is crowded (crowded room and/or rooms), the second child neural network 310-2 may provide optimal performance when the environment is sparsely populated (sparsely occupied premises and/or premises).
  • child neural networks 310 can train the global neural network to provide good performance in a number of types of environments (eg, environments with a number of different properties). Periodically, the parameters of the global neural network are transmitted to the child neural networks, therefore, the child neural networks indirectly transmit their parameters.
  • the first child neural network 310-1 can be trained based on the first copy of the virtual environment 101-1, which is characteristic of a crowded environment
  • the second child neural network 310-2 can be trained based on the second copy of the virtual environment 101-2, which characteristic of a sparsely populated environment.
  • the first child neural network can be trained indirectly for use in a sparsely populated environment by receiving parameters from a global neural network that has previously received gradients and/or parameters from the second child neural network.
  • neural networks converge (neural networks converge) to provide a series of similar and/or identical child neural networks (and a global neural network that is the same) that provide optimal performance for a wide range of environments (sparsely populated environment, crowded environment etc.).
  • reinforcement learning system 110 can be used to provide suitable output given a given set of inputs. This convergence (or near convergence) can be indicated, for example, by displaying a notification to the user, or by providing output based on input to reinforcement learning system 110.
  • the frequency of data transmission e.g., transmission of gradients/parameters from child neural networks 310 to global neural network 300 and transmission of parameters from global neural network to child neural networks
  • the frequency of data transmission can be based on multiple steps, user input, or rate of convergence.
  • the global neural network is designed to provide parameters to child neural networks no more than once every 20 steps, no more than once every 50 steps, and/or no more than once every 100 steps. Similarly, in various embodiments, the global neural network is designed to provide parameters to child neural networks at least once every 200 steps, at most once every 100 steps, and/or at least once every 50 steps.
  • reinforcement learning system 110 As shown in FIG. 8:
  • each child neural network-actor 311 receives a set of tuples from the corresponding copy of the virtual environment 101 - in accordance with the process 31.
  • Each tuple contains data characterizing the state of the environment, an action from a given area of action performed by agent 200 in response to the data, the reward for the action taken by agent 200, and the next a set of input data that characterizes the next learning state of the environment.
  • Each child actor neural network 311 then updates the 34 current values of its parameters using the data obtained from the following sequence of actions: a) Each input and action in the resulting tuple is processed, in accordance with process 32, using each child neural network critic network 312 to determine the output of the corresponding child actor neural network 311 for the tuple according to the current parameters of the child actor neural network 311 and the child critic network 312. Process 32 determines a predicted outcome or expected reward (e.g., expected performance) child neural network 310 for the received tuple of the training reward and the next set of training inputs in the same tuple.
  • a predicted outcome or expected reward e.g., expected performance
  • the current parameter values of child actor neural network 311 and child critic neural network 312 are then updated, according to process 33, using an estimate of the benefit of the predicted outputs of child neural network 310 and reward from the environment, and based on the condition(s) and the metric(s) obtained from the simulator 400. For example, if the simulator 400 indicates that the child neural network is performing well below the maximum known possible performance, the parameters can be substantially changed; if the simulator 400 indicates that the child neural network is operating close to the maximum known possible performance, the parameters may only be changed by a small amount.
  • the reward is usually based on the difference between the output of the actor's child neural network and the metric provided by the simulator; b) updating, in accordance with process 33, the current parameters of each child actor neural network 311 using the corresponding child critic neural network 312 and, for example, the entropy loss function.
  • the current values of the parameters of the global actor neural network 301 and global critic neural network 302 are updated, in accordance with process 35, based on gradients and/or parameters sent by child neural networks.
  • Each child neural network may send these gradients/parameters at the same time, or the timing of the gradients/parameters may differ for different child neural networks.
  • the parameters of the child actor neural network 311 and the child critic neural network 312 are updated, in accordance with the process 36, using the parameters of the global neural network 300.
  • An in-process update 36 of the current parameters of each critic child neural network 312 is performed using the error between the reward given from the environment for performing the selected action and the child actor neural network 311 output generated by the critic child neural network 312 for evaluation expected reward (for example, the difference between the desired and achieved value).
  • the error may be used between the output of the child neural network actor 311 for the current observation and the results of the simulator 400.
  • the system may determine an update to the current parameters, reducing the error using conventional machine learning and optimization techniques such as performing backpropagation gradient descent iteration.
  • the reinforcement learning system 110 having processed the input data using the child neural network actor 311, selects the next action and updates these parameters using one or more of the following:
  • the gradient provided by the critic child network 312 (gradient 1) with respect to the next action taken in processing the associated inputs and outputs in accordance with the current parameters of the critic child neural network (eg, improving and/or reducing performance, which is determined by child neural network-critic will take place if the action is proposed based on the current parameters of the child neural network-actor);
  • the reinforcement learning software system may compute gradient 1 and gradient 2 by backpropagating the respective gradients through the respective neural networks.
  • the reinforcement learning software system performs this process for each input tuple after each update of the parameters of the child critic 312 or the child actor 311 (eg, after each training step). Once the updates for each tuple have been calculated, the reinforcement learning software system updates the current parameters of each child actor neural network 311 and each child critic neural network 312 using the tuple updates. Thus, reinforcement learning system 110 iteratively improves the parameters of each of the neural networks.
  • reinforcement learning system 110 updates, in accordance with process 36, the parameters of global actor neural network 301 and global critic neural network 302.
  • Desynchronization of the parameters from the global neural network 300 occurs using the condition applicable to the child neural networks 310 (for example, this occurs when a certain number of actions have been performed by the child neural networks), and usually occurs by copying the parameters of the global neural network 300 after the stage of updating the parameters of the global neural network.
  • the global neural network 300 After processing a certain number of mini-packets of data, the global neural network 300 resets the gradients, and the process is repeated until the agent 200 achieves the required decision quality. [0139] Thus, by iteratively updating the current values of the parameters, the reinforcement learning software system trains the global neural network 300 and child neural networks 310 to generate neural network outputs that represent the cumulative time-amortised future rewards that can be obtained in response for agent 200 to perform a given action in response to a given set of inputs.
  • the first child neural network 310-1 can interact with a copy of the virtual environment 101-1, which refers to a crowded environment (high-density room or rooms). After a series of training steps, the parameters of the first child neural network will reach values that ensure good performance of the agent 200 for this type of environment.
  • the second child neural network 310-2 may interact with a copy of the virtual environment 101-1 that is sparsely populated (with a low density of room or rooms) and may have parameters that make the agent 200 work well for this type of environment.
  • the global neural network 300 receives gradients from each of these child neural networks and uses these gradients to form a neural network that performs well in each environment.
  • these parameters of this global neural network are periodically copied to child neural networks.
  • the parameters may not perform optimally for one or more types of environments, so the child neural networks are retrained based on the appropriate copies of the environments (and then the parameters of the child neural networks are again provided to the global neural network). Over time, all neural networks converge into a neural network that provides optimal performance for each type of environment.
  • the first agent 200-1 may be configured to manage air conditioners, while the second agent 200-2 could be configured to control radiators (and each of these agents would have different possible actions and different areas of action).
  • a separate global neural network-actor, a global neural network-critic, child neural networks-actors and child neural networks-critics can be created. This allows you to quickly train neural networks based on different action spaces.
  • FIG. 9 An illustration of the disclosed interaction scheme for a system containing multiple agents is shown in FIG. 9.
  • each of the agents 200-1, 200-2, 200-K interacts with a respective reinforcement learning system 110-1, 110-2, 110-K.
  • Each of these reinforcement learning systems contains many child neural networks.
  • Each of these reinforcement learning systems is designed to interact with a common set of copies of the virtual environment 101-1, 101-2, 101-N, where for each of the copies of the virtual environment, one child neural network from each reinforcement learning system interacts with this copy.
  • reinforcement learning systems can also be trained completely independently with completely separate copies of virtual environments.
  • backpropagation and parameter updates for these child neural networks and global neural networks can occur independently of each other.
  • each reinforcement learning system 110-1, 110-2, 110-K can be trained separately, and they can have neither common parameters nor common gradients.
  • copies of virtual environments 101-1, 101-2, 101-N are forced to take actions from all agents 200 at the same time (in one tuple).
  • all agents perform actions and add their actions to the combined tuple, which is then passed to the environment to receive an updated state.
  • this may mean that for each copy of the virtual environment, the actions related to the first agent 200-1 are first performed (for example, turning air conditioning units on or off). Actions related to the second agent 200-2 are then performed (eg turning radiators on or off).
  • Agents 200 having an identical set of available actions are typically combined into a single agent having one set of global actor neural networks, global critic neural networks, child actor neural networks, and child critic neural networks. This avoids redundant training (which wastes available computing power). For example, there may be multiple agents, each controlling similar HVAC units. These agents may have the same action space, so they can be combined and trained at the same time. The action space may be determined based on the set 102 of configuration data and input device descriptions.
  • child actor neural networks 311 and child critic neural networks 312 train a policy, calculate the effect of changing the parameters of these neural networks, and update the gradients in the global neural network 300 using not only the virtual environment and information about taken or potential actions, but also the results of the 400 baseline simulator in addition to the entropy estimation function to improve the quality of training.
  • the key performance indicators or desired conditions provided by the simulator 400 are also used by the critic child neural network to evaluate the quality of the selected policy. For example, child critical neural networks can determine if desired conditions have been met and then can evaluate a performance metric to determine if the neural network is performing well.
  • the simulator 400 provides a separate metric for each child network and/or for each copy of the virtual environment.
  • the output of simulator 400 typically contains a number of metrics along with rewards/priorities for those metrics.
  • the rewards accumulated by the simulator 400 are then compared with the rewards accumulated by each agent 200.
  • the performance of each agent 200 may then be compared to the simulator 400.
  • the metric provided by the simulator 400 may relate to the performance of the simulator 400 (e.g., how close did the simulator 400 come to achieving a certain temperature and/or humidity, and how much power was required to achieving these conditions). This the simulator 400 metric can then be compared with each agent to determine if any of the agents outperformed the reference simulation (simulation).
  • the process of determining an update for the current parameter values of the global neural network 300 or child neural networks 310 may be performed by a system of one or more computers located in one or more locations.
  • any neural network may include one or more layers of batch normalization.
  • reinforcement learning system 110 is adapted to train global neural network 300 to have optimal parameters for controlling HVAC devices in the environment to achieve a desired set of conditions (which desired set of conditions can be based on user choice or neural network output).
  • the actions selected by the global neural network given a particular set of inputs are provided to the agent 200, which is adapted to interact with the environment (eg, HVAC devices in the environment) in accordance with the provided actions.
  • agent 200 can be adapted to modify the operation of radiators, air conditioners, dehumidifiers, and so on. to maintain the desired range of temperatures, humidity, etc.
  • FIG. 10 describes an embodiment in which a reinforcement learning system is used in conjunction with an agent 200 to control an HVAC device or HVAC system.
  • An HVAC system is a system containing a plurality of HVAC devices, which may include a central control system or a controller (eg, a computer device for controlling a plurality of computer devices).
  • the (trained) neural network takes as input information about the change in the environment (environment).
  • This change may refer to, for example: a change in the weather, a change in population in the environment, and/or the opening and closing of doors.
  • the input signal can be obtained from the sensor of the HVAC unit.
  • a plurality of HVAC devices are connected (connected) to form an HVAC system that can be controlled by a controller.
  • the controller may then send input to each of the HVAC units, for example, a desired change in conditions or a change in environmental conditions may be sent to the HVAC units.
  • a controller can contain a neural network where the controller determines the appropriate action for each of the HVAC units and communicates these actions to the HVAC units.
  • the controller may comprise a plurality of neural networks that are associated with different HVAC devices and/or groups of HVAC devices.
  • a neural network may be present on the computing device of a single HVAC device.
  • the neural network provides (eg, recommends) an action based on the input.
  • the action is determined based on the parameters of the neural network whose parameters have been determined using the learning method described above, so that the neural network provides the appropriate action for the given input.
  • the neural network implements the action via the agent 200.
  • the agent may change the operation of the HVAC device.
  • the HVAC device and/or HVAC system provides output related to the activities.
  • an HVAC device may display energy consumption to a user and/or provide an audible alert indicating a deviation from a set of target conditions.
  • each HVAC device and/or HVAC system is adapted to periodically output usage information related to the use of the HVAC devices in the environment.
  • a reinforcement learning software system to control an HVAC device typically involves using one of the constituent neural networks (eg, global actor neural network 301) to control the HVAC device.
  • one of the constituent neural networks eg, global actor neural network 301
  • This neural network (eg, global actor neural network 301) can be trained using the methods disclosed herein, and then can be transmitted to an HVAC device or controller for an HVAC system. This may include transfer parameters of the trained global neural network-actor to the HVAC device or controller. This trained neural network can then control the HVAC device depending on inputs received (eg, measured environmental changes and/or signals received from the HVAC control system).
  • the reinforcement learning software system as a whole is present only during training, with global neural network 300 (and/or global neural network actor 301) being transferred from the training device to the HVAC device after neural the networks converged and the training of the reinforcement learning system was completed.
  • training reinforcement learning system 110 may include one or more of the following: unsupervised deep learning and/or shallow learning, generative adversarial networks (GANs), variational autoencoders (VAEs). Such techniques may allow a once trained HVAC device model to be reused for similar HVAC devices with dissimilar action spaces. While reinforcement learning software typically uses reinforcement learning, other machine learning methods can be used instead of or in addition to reinforcement learning.
  • GANs generative adversarial networks
  • VAEs variational autoencoders
  • reinforcement learning system 110 may use any type of learning system, where the reinforcement learning system would normally still includes a link with a global learning system and many child learning systems.
  • the type of machine learning systems used may be determined based on the environment and/or input datasets.
  • the reward generation function used to train the actor-critic system is dynamic and/or selected from a variety of possible reward generation functions.
  • the reward generation function may depend on the rate of convergence and/or difference in performance related to key metrics (for example, if the actor-critic system is close to optimal for the first of the key metrics, but performs poorly for the second key metric, then the generation function rewards can be modified to prioritize the second key metric).
  • training reinforcement learning system 110 may include hyperparameter auto-correction.
  • multiple agents are trained at the same time, and these agents have different reward generating functions. This may include training multiple agents for a single HVAC device or for a group of HVAC devices. These agents can then be activated and used depending on the desired set of conditions. For example, a first group of agents may be trained to perform optimally for a first set of target conditions, and a second group of agents may be trained to perform optimally for a second set of target conditions.
  • the environment is usually associated with the building.
  • the environment refers to a plurality of related buildings (eg, a plurality of neighboring buildings).
  • Reinforcement learning system 110 and agent 200 can then be trained to work together optimally (eg, one building may be affected by turning on the air conditioning for another building).
  • game theory is used to determine the rational allocation of resources.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Air Conditioning Control Device (AREA)

Abstract

L'invention se rapporte au domaine des techniques pour commander au moins un dispositif de chauffage, de ventilation et de climatisation d'air (CVC), et pour instruire un système d'apprentissage sur un fonctionnement avec un dispositif CVC. Le résultat technique de l'invention consiste en une augmentation de la précision de commande et une diminution des dépenses. Ce système de commande de dispositif de chauffage, de ventilation et de climatisation d'air (CVC) comprend: un réseau neuronal capable d'apprentissage via un système d'apprentissage; un système d'apprentissage du réseau neuronal du système de commande; un simulateur des indices de base; un module d'environnement virtuel; un contrôleur et une mémoire stockant des instructions pour lancer l'apprentissage du réseau neuronal du système de commande via le système d'apprentissage du réseau neuronal en fonction d'étapes comprenant entre autres l'obtention d'un ensemble d'entrée de données; la création, via le module d'environnement virtuel, d'un environnement virtuel lié à un ou plusieurs dispositifs CVC sur la base de l'ensemble d'entrée de données obtenu, l'exécution par le simulateur d'indices de base d'une modélisation du mode de fonctionnement du dispositif virtuel CVC sur la base de l'ensemble d'entrée de données obtenu; l'obtention de mesures cibles sur la base de la modélisation réalisée et l'exécution de l'apprentissage du réseau neuronal du système de commande en fonction des mesures cibles obtenues et en fonction de l'ensemble d'entrée de données.
PCT/RU2021/000472 2021-10-29 2021-10-29 Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air WO2023075631A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU2021/000472 WO2023075631A1 (fr) 2021-10-29 2021-10-29 Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air
PCT/GB2022/051855 WO2023073336A1 (fr) 2021-10-29 2022-07-18 Procédé, support lisible par machine et système de commande permettant de commander au moins un dispositif de chauffage, ventilation et climatisation (cvca)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2021/000472 WO2023075631A1 (fr) 2021-10-29 2021-10-29 Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air

Publications (1)

Publication Number Publication Date
WO2023075631A1 true WO2023075631A1 (fr) 2023-05-04

Family

ID=83457251

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/RU2021/000472 WO2023075631A1 (fr) 2021-10-29 2021-10-29 Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air
PCT/GB2022/051855 WO2023073336A1 (fr) 2021-10-29 2022-07-18 Procédé, support lisible par machine et système de commande permettant de commander au moins un dispositif de chauffage, ventilation et climatisation (cvca)

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/GB2022/051855 WO2023073336A1 (fr) 2021-10-29 2022-07-18 Procédé, support lisible par machine et système de commande permettant de commander au moins un dispositif de chauffage, ventilation et climatisation (cvca)

Country Status (1)

Country Link
WO (2) WO2023075631A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190163215A1 (en) * 2017-11-27 2019-05-30 General Electric Company Building energy modeling tool systems and methods
US20190354071A1 (en) * 2018-05-18 2019-11-21 Johnson Controls Technology Company Hvac control system with model driven deep learning
KR102170522B1 (ko) * 2018-10-18 2020-10-27 부산대학교 산학협력단 관람객 수와 환경 변화를 고려한 전시홀 에너지 관리 시스템
US20210132552A1 (en) * 2019-11-04 2021-05-06 Honeywell International Inc. Method and system for directly tuning pid parameters using a simplified actor-critic approach to reinforcement learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665651B2 (en) * 2001-07-18 2003-12-16 Colorado State University Research Foundation Control system and technique employing reinforcement learning having stability and learning phases
US7345691B2 (en) * 2004-12-02 2008-03-18 Winbond Electronics Corp. Method of image processing and electronic device utilizing the same
AU2016297852C1 (en) 2015-07-24 2019-12-05 Deepmind Technologies Limited Continuous control with deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190163215A1 (en) * 2017-11-27 2019-05-30 General Electric Company Building energy modeling tool systems and methods
US20190354071A1 (en) * 2018-05-18 2019-11-21 Johnson Controls Technology Company Hvac control system with model driven deep learning
KR102170522B1 (ko) * 2018-10-18 2020-10-27 부산대학교 산학협력단 관람객 수와 환경 변화를 고려한 전시홀 에너지 관리 시스템
US20210132552A1 (en) * 2019-11-04 2021-05-06 Honeywell International Inc. Method and system for directly tuning pid parameters using a simplified actor-critic approach to reinforcement learning

Also Published As

Publication number Publication date
WO2023073336A1 (fr) 2023-05-04

Similar Documents

Publication Publication Date Title
Ahn et al. Application of deep Q-networks for model-free optimal control balancing between different HVAC systems
Han et al. A review of reinforcement learning methodologies for controlling occupant comfort in buildings
Wei et al. Multi-objective optimization of the HVAC (heating, ventilation, and air conditioning) system performance
KR102212663B1 (ko) 목표 온도를 기반으로 하는 빌딩의 열·공조 시스템에 대한 공급 전력 제어 방법 및 장치
Wang et al. A novel ensemble learning approach to support building energy use prediction
Hussain et al. Comfort-based fuzzy control optimization for energy conservation in HVAC systems
Platt et al. Adaptive HVAC zone modeling for sustainable buildings
Chen et al. Modeling and optimization of complex building energy systems with deep neural networks
US9429921B2 (en) Method and system for energy control management
Fu et al. ED-DQN: An event-driven deep reinforcement learning control method for multi-zone residential buildings
Li et al. Multi-objective optimization of HVAC system using NSPSO and Kriging algorithms—A case study
Lissa et al. Transfer learning applied to reinforcement learning-based hvac control
Al-Daraiseh et al. Multi-agent system for energy consumption optimisation in higher education institutions
Han et al. A review of reinforcement learning methodologies on control systems for building energy
Faddel et al. Data driven q-learning for commercial hvac control
Homod et al. Deep clustering of Lagrangian trajectory for multi-task learning to energy saving in intelligent buildings using cooperative multi-agent
Nagy et al. Reinforcement learning for intelligent environments: A Tutorial
Zhang et al. Grid-interactive multi-zone building control using reinforcement learning with global-local policy search
WO2021038759A1 (fr) Procédé de sélection de paramètres, programme de sélection de paramètres et dispositif de traitement d'informations
Talib et al. Grey-box and ANN-based building models for multistep-ahead prediction of indoor temperature to implement model predictive control
Jeen et al. Low emission building control with zero-shot reinforcement learning
Yu et al. A systematic review of reinforcement learning application in building energy-related occupant behavior simulation
Faddel et al. On the performance of data-driven reinforcement learning for commercial HVAC control
WO2023075631A1 (fr) Système de commande de dispositifs de chauffage, de ventilation et de climatisation d'air
CN116017936A (zh) 一种空调机房的控制方法及装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21962658

Country of ref document: EP

Kind code of ref document: A1