WO2023075631A1

WO2023075631A1 - System for controlling heating, ventilation and air conditioning devices

Info

Publication number: WO2023075631A1
Application number: PCT/RU2021/000472
Authority: WO
Inventors: Александр Юрьевич БЕЛЯЕВ; Максим Анатольевич ЗУБОВ; Сергей Евгеньевич Шалунов; Владимир Анатольевич ПУШМИН
Original assignee: Ооо (Общество С Ограниченной Ответственностью) "Арлойд Аутомейшн"
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-05-04
Also published as: WO2023073336A1

Abstract

The invention relates to the field of technology for controlling at least one heating, ventilation and air conditioning (HVAC) device, as well as for training a trainable system to work with an HVAC device. The technical result of the invention is an improvement in control accuracy and a reduction in costs. A system for controlling a heating, ventilation and air conditioning (HVAC) device comprises: a neural network capable of being trained by a training system; a training system of the neural network of the control system; a reference parameter simulator; a virtual environment module; a controller; and a memory for storing instructions for prompting training of the neural network of the control system using the neural network training system according to steps that include, inter alia, acquiring an input data set; creating, by means of the virtual environment module and on the basis of the acquired input data set, a virtual environment pertaining to one or several HVAC devices; modelling, by means of the reference parameter simulator and on the basis of the acquired input data set, an operating mode of a virtual HVAC device; obtaining a target metric on the basis of said modelling; and training the neural network of the control system in accordance with the obtained target metric and in accordance with the input data set.

Description

CONTROL SYSTEM FOR HEATING, VENTILATION AND AIR CONDITIONING DEVICES

Technical field:

[0001] The present invention relates to the field of computer technology for controlling at least one heating, ventilation and air conditioning (HVAC) device, as well as for teaching a training system to work with the HVAC device.

State of the art:

[0002] Neural networks are machine learning models that use one or more layers that non-linearly transform data from blocks called neurons to predict outputs for given inputs. Some of these networks are deep networks with one or more hidden layers in addition to the output layer. The output of each layer is used as input to the next layer of the neural network, such as the hidden layer or the output layer. Each level generates its output from the received input data in accordance with the current values of the parameter set.

[0003] There are many well-known methods for training agents that manage complex systems. These methods attempt to train a system that can choose optimal solutions from a fixed set of actions. Such systems usually require a large amount of training data and often cannot adapt to situations that were not included in the training dataset.

[0004] One example of such systems is the continuous control method using deep reinforcement learning described in RU 2686030 C1. The known method includes obtaining a mini-package of experimental tuples and updating the current values of the parameters of the neural network-actor (neural network-actor), containing for each experimental tuple in the mini-package: processing the training observation and training action in the experimental tuple using the neural network-critic to determine the output of the neural networks for the experimental tuple and determining the predictive output of the neural network for the experimental tuple. Updating the current values of the parameters of the neural network-critic using errors between the predictive outputs of the neural network and the outputs of the neural network, and updating the current values of the parameters of the neural network-actor using the neural network-critic.

[0005] However, the known invention has disadvantages. In the case of applying the known method to use in HVAC unit control, disadvantages of the known method may include poor control accuracy of at least one HVAC unit to provide optimal heating, ventilation and air conditioning in the room. Also, among the disadvantages, the low accuracy of maintaining a constant internal microclimate should be indicated. These shortcomings are due to the fact that the use in the known method of the predictive neural network-actor and the predictive neural network-critic to determine updates for the current values of the parameters of the neural network-actor leads to a significant increase in computational complexity and, consequently, to an increase in the time required for neural network training. An increase in computational complexity leads to a decrease in the control accuracy of the HVAC device, since this circumstance leads to a calculation error.

[0006] When training multiple agents, the computational complexity can grow exponentially, resulting in excessive training time. It is desirable to develop new neural network training methods that ensure that neural networks perform well while minimizing the time required to train such neural networks.

Disclosure of the invention:

[0007] The problem solved by the claimed invention is to eliminate at least one of the above drawbacks.

[0008] The technical result of the claimed invention is to improve the control accuracy of at least one heating, ventilation and air conditioning (HVAC) device to improve the accuracy of maintaining a constant internal microclimate and, as a result, reduce energy costs.

[0009] An additional technical result is to minimize the time and computing resources required to train the neural network of the control system with at least one HVAC device.

[0010] In the first possible embodiment of the present invention, a controller is provided

[OOP] Additionally, the target metric includes at least one of the following or combinations thereof: electricity and/or power consumption, cost of electricity and/or power, HVAC unit run time, number of changes in HVAC unit operation, deviation from target conditions .

[0012] Additionally, the learning system comprises an actor-critic system comprising an actor neural network and a critic neural network. [0013] Additionally, the learning system comprises a global actor-critic learning system and one or more child actor-critic learning systems, wherein: the global actor-critic system comprises a global neural network-actor and a global neural network-critic; one or more actor-critic child systems contain an actor-child neural network and a critic-child neural network.

[0014] Further, training the neural network of the control system includes: transferring gradients from at least one child learning system to the global learning system and updating parameters of the global learning system based on the gradients received from at least one child learning system; copying parameters from the global learning system to at least one child learning system; and repeating the steps of transmitting gradients and copying parameters, wherein the steps of transmitting gradients and copying parameters are repeated until each of the neural networks of the child system and the global system converge.

[0015] Additionally, training the neural network of the control system includes determining the combination of at least two neural networks to provide a given mode of operation of at least one virtual HVAC device in accordance with the said virtual environment.

[0016] Additionally, training the neural network of the control system includes training at least one child system depending on a separate copy of the virtual environment, each of the copies of the virtual environment has different initialization properties, and the initialization properties include at least one of the following values or a combination of them: density of people in the room, different configurations at least at least one HVAC virtual appliance and/or different virtual room model configurations.

[0017] Additionally, the baseline simulator is configured to provide a separate target metric for at least one child system, with separate target metrics for said child system being determined depending on the corresponding separate copy of the virtual environment.

[0018] Additionally, the neural network of the control system is trained depending on a set of target conditions provided by the baseline simulator, where the set of target conditions includes one or more of the conditions: indoor temperature and/or outdoor temperature, indoor humidity and/or humidity outside the premises.

[0019] Additionally, the neural network is trained depending on a set of target indicators and/or sets of target conditions provided by the baseline simulator, where each target metric is used by the learning system for a separate training step.

[0020] Further, training the neural network of the control system includes creating a reinforcement learning agent configured to operate depending on the training system, wherein the agent is configured to interact with at least one virtual HVAC device in said virtual environment.

[0021] Additionally, the baseline simulator is configured to provide a plurality of targets and/or target conditions depending on the maximum number of actions that said agent can take during a predetermined learning period.

[0022] Additionally, the baseline simulator is configured to provide a target metric depending on the input dataset containing the time-limited set data, while the neural network of the control system is trained depending on the same time-limited data set.

[0023] Additionally, the baseline simulator is configured to provide a target metric without using the neural network of the control system and/or to provide a target metric using one or more of the following methods: a decision tree method, a linear data transformation method, analysis-based methods connections, methods using gradient boosting.

[0024] Additionally, the baseline simulator is configured to provide a target metric depending on one of the following: user input to control the operation of at least one HVAC device in said virtual environment and/or stored data related to the operation of the virtual HVAC devices .

[0025] Additionally, the input dataset contains one or more of the following data: HVAC device description and/or configuration data in said virtual environment, building layout and/or area layout data, climate condition data.

[0026] Additionally, training the neural network of the control system includes providing a plurality of said agents, where each of the agents is configured to perform a plurality of actions and/or communicate with at least one HVAC device.

[0027] Additionally, training the neural network of the control system includes training a plurality of learning systems, where training of each learning system of the plurality of learning systems occurs in said environment for each virtual HVAC device and/or for each group of a plurality of groups of similar virtual HVAC devices. [0028] Additionally, training a control system neural network includes training a plurality of learning systems associated with a plurality of said agents, each learning system being configured to interact with a respective agent.

[0029] Additionally, training the neural network of the control system includes: training the first child system of the first learning system and the first child system of the second learning system depending on the first copy of said virtual environment; training a second child system of the first learning system and a second child system of the second learning system depending on the second copy of said virtual environment; wherein the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment simultaneously and/or the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment according to a predetermined condition.

[0030] In a second possible embodiment of the present invention, there is provided a method for controlling at least one heating, ventilation and air conditioning (HVAC) device, comprising the steps of: comprising the steps of: receiving an input data set; create a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which said at least one virtual HVAC device model; performing an operation mode simulation of said at least one virtual HVAC device in said virtual model premises based on the received input dataset; obtaining a target metric based on the simulation performed; and performing training of the neural network of the control system, at least one HVAC device in accordance with the received target metric and in accordance with the input data set; the controller generates control instructions; and transmitting said instructions to at least one HVAC device, where said control instructions are generated based on the received target metric.

[0031] In a third exemplary embodiment of the present invention, a computer readable medium is provided for storing a computer program product containing instructions that, when executed by a processor, cause the processor to perform a method for controlling at least one heating, ventilation, and air conditioning (HVAC) device.

[0032] In a fourth possible embodiment of the present invention, a method is provided for training a neural network of a control system with at least one heating, ventilation, and air conditioning (HVAC) device, comprising the steps of: receiving an input data set; create, by means of a virtual environment module, a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the mentioned is located, at least one virtual HVAC device model; executing, with the baseline simulator, an operation mode simulation of said at least one virtual HVAC device in said virtual room model based on the received input data set; obtaining a target metric based on the simulation performed; and performing training of the neural network of the control system in accordance with the obtained target metric and in accordance with the input data set.

[0033] Obviously, both the foregoing general description and the following detailed description are for exemplary and explanatory purposes only and are not limitations of the present invention.

Brief description of drawings:

[0034] FIG. 1 shows an exemplary view of a virtual environment containing an HVAC device.

[0035] FIG. 2 shows a system including a reinforcement learning system.

[0036] FIG. 3 shows a computer device on which the system shown in FIG. 2.

[0037] FIG. 4 shows a neural network of a control system for at least one HVAC unit.

[0038] FIG. Figure 5 shows the "actor-critic" ("performer-critic") system.

[0039] FIG. 6 shows a detailed version of a reinforcement learning system for the system shown in FIG. 2.

[0040] In FIG. 7 shows the interaction of child and global neural networks in accordance with FIG. 6.

[0041] FIG. 8 shows a method for updating the weights of the child neural networks and the global neural network in accordance with FIG. 6.

[0042] FIG. 9 shows interactions between agents, reinforcement learning systems, and virtual environment copies that may be present in the system of FIG. 2.

[0043] FIG. 10 shows a method for providing a result based on an action proposed by a reinforcement learning system. Implementation of the invention:

[0044] Referring to FIG. 1, the description discloses an exemplary view of a virtual environment containing an HVAC device. In particular, in FIG. 1 shows a room 10 which includes two radiators 12-1, 12-2 (special cases of HVAC devices) and an air conditioning unit 14 (special case of HVAC device). Each of these HVAC units can be used to change indoor conditions. For example, radiators can be used to heat a space, and an air conditioner can be used to cool a space.

[0045] The use of these devices requires certain costs, such as the cost of electricity and the cost of maintaining the device. Therefore, it is desirable that these devices operate efficiently to maintain the desired indoor conditions with accuracy corresponding to the desired or target indoor conditions.

[0046] In this case, a neural network for controlling an HVAC device is disclosed, as well as a method for training this neural network and a control system for at least one HVAC device containing such a neural network. It should be understood that the methods and systems disclosed herein may also be used in other situations (not related to HVAC).

[0047] The neural network may be present in the HVAC device itself. Similarly, an HVAC device may be present in the control system that controls the device. In particular, a controller may be provided comprising one or more neural networks, this controller being able to control the HVAC system and perform training of said control system neural network, the HVAC system comprising a plurality of HVAC devices.

[0048] Next, with reference to FIG. 2 describes a system in which the neural network training system 110 of the control system with reinforcement (hereinafter reinforcement learning system 110) is trained depending on: one or more sets of input data 102, 103, 104, virtual environment module 100, and baseline simulator 400. The virtual environment module 100 may be implemented by a separate computing device in conjunction with software and configured to create a virtual environment associated with one or more HVAC devices, including creating, based on the received input data set, at least one virtual model of the HVAC device , and at least one virtual model of the room in which said at least one virtual model of the HVAC device is located.

[0049] In the example shown in FIG. 2, the input datasets comprise a device description and configuration dataset 102, a building description dataset 103, and a weather dataset 104. It should be kept in mind that, more generally, input datasets may contain a number of input data and that different datasets may be used in practice.

[0050] The exemplary data sets in FIG. 2 are particularly useful where the learning system is used to control a heating, ventilation and air conditioning (HVAC) system. It should be understood that other datasets may be used, particularly when the training system is applied to other systems.

[0051] The set of device descriptions and configurations may, for example, refer to the available configurations of air conditioners, thermostats, humidifiers, heat exchangers, etc. More generally, one of the input datasets is typically a dataset that defines an action space and/or a set of available actions for an agent (eg turning an air conditioner on or off). Then based on the reinforcement learning software system suitable actions are identified that the agent should take in practice (after the reinforcement learning system has been trained). For example, a set of weather conditions can be used as input to a reinforcement learning system, and agent 200 can turn an air conditioner on or off based on the output of the reinforcement learning system. In the basic example, the reinforcement learning system can identify that the air conditioner should be turned on when there are high temperature weather conditions. In the context of the present invention, the agent means the performer of control actions. It has one or more service capabilities that form a single and complex execution model, which may include access to external software (SW), users, communications, HVAC devices, etc. In other words, an agent is a module for managing external devices. The agent is implemented as firmware and may be implemented, for example, as an HVAC device control computer, an HVAC device control controller located in and/or outside the HVAC device, etc.

[0052] The input datasets may include building description 103 datasets. The description of a building may indicate the dimensions, plans, and/or zonal subdivisions of the building. Reinforcement learning system 110 is then trained to control the available devices to achieve the desired building conditions.

[0053] The input datasets may include a weather dataset 104 (which, more generally, may be an environmental dataset). This data set can be used in conjunction with the building description to determine appropriate agent actions that are relevant to the operation of the HVAC units. In particular, changes in weather conditions (for example, changes in temperature) may require changes to the operation of the HVAC unit if the desired indoor conditions are to be maintained.

[0054] Each input data set 102, 103, 104 is configured to be passed to the virtual environment module 100. The virtual environment module 100 is configured to send data to the reinforcement learning software system. In turn, the system of reinforcement learning software is designed to transmit data to the agent 200, which is configured to control the parameters of the environment (for example, to control one or more devices in the building). Thus, the reinforcement learning software system is configured for learning depending on the configuration of the virtual environment module 100 and hence the input datasets.

[0055] The virtual environment module 100 contains information related to the environment. For example, the virtual environment module 100 may contain a database of HVAC devices along with their possible control signals and their location (what information can be obtained from the input datasets). Thus, the agent 200 can interact with the virtual environment module 100 to manage the devices present in the virtual environment.

[0056] In the exemplary virtual environment of room 10 in FIG. 1, virtual environment module 100 may contain the dimensions of the room as well as information (eg, possible states) for each of the HVAC devices 12-1, 12-2, 14 in the room. Module 100 may further contain information about the configuration of the room, such as whether the room has furniture and/or whether the door is open, and information about activities in the room, such as whether the room is full or empty. Generally, virtual environment usually refers to a larger virtual environment than a room (eg, a building or multiple buildings).

[0057] The baseline simulator 400 is configured to provide a simulation of the behavior of at least one virtual HVAC device in said virtual room model based on the received input data set. Typically, the simulator 400 is defined using a method other than reinforcement learning, such as a statistical model.

[0058] Also, the baseline simulator 400 is designed to receive input from at least one of the input datasets. In particular, the simulator 400 is typically adapted to receive a HVAC device description and configuration data set 102 and/or a building description data set 103. Typically, simulator 400 is configured to receive each of the input datasets to provide a basic output dataset (given those datasets).

[0059] Typically, the simulator 400 is configured to provide achievable outputs related to controlling the virtual environment using conventional modeling techniques. For example, simulator 400 may simulate control of devices in a virtual environment based on equations or based on conditional logic (eg, "if the temperature rises above 27°C, turn on the air conditioner"). Thus, for any input data set, simulator 400 can simulate output conditions (humidity and temperature) that can be achieved using known tools and/or existing models.

[0060] In some embodiments, the simulator 400 is used to determine and provide the desired conditions environment (environment). For example, baseline simulator 400 may provide the desired temperature and/or humidity. These desired conditions may be based on one or more of: user input, historical data, and/or simulation (eg, simulation that determines achievable values given user input). A reinforcement learning system can then be set up to achieve these target conditions.

[0061] Additionally or alternatively, the simulator 400 is typically designed to provide a key metric related to providing certain conditions. For example, simulator 400 may provide power usage statistics, device usage statistics, and/or device wear statistics that are required to obtain a set of conditions. This metric can be used as a goal by a reinforcement learning software system, where the reinforcement learning system is designed to provide the same or similar conditions with an improved metric (eg, lowering the cost of electricity).

[0062] In various embodiments, the metric refers to one or more of the following: electricity and/or power consumption; the cost of electricity and/or power (electricity is usually cheaper at certain times of the day, so minimizing usage and minimizing costs may require different actions); operating time of the device; the number of times the operating mode of the device is changed (this may affect the life of the device); and deviation from desired conditions (eg, maximum deviation, average deviation, and/or sum of deviations).

[0063] In some embodiments, simulator 400 provides a variety of metrics. In some embodiments, the provided metric may refer to a plurality of constituents. metrics. In each of these situations, simulator 400 may indicate (eg, based on user input) the order of precedence for the metrics. As an example, simulator 400 may indicate that a reinforcement simulator should prioritize achieving a target condition set and then seek to minimize costs. Alternatively, the simulator 400 may indicate that a small deviation from the target condition set is acceptable if the deviation would significantly reduce power consumption. Moreover, performing training of the neural network of the control system in accordance with the stages and, ultimately, obtaining a target metric for further control of at least one HVAC device ensures high accuracy in maintaining a constant indoor microclimate to achieve comfortable indoor climate conditions. This circumstance is due to the fact that continuous training in accordance with the data characterizing the individual implementations of both HVAC equipment and the possible implementations of the premises ensures the achievement of positive results in maintaining comfortable climatic conditions in the premises.

[0064] In some embodiments, simulator 400 is configured to use historical data. For example, the operation of an existing HVAC system for several months can be used to create a simulator 400. This historical data can be used to determine the conditions that are desirable for the inhabitants of this building, and to determine the operation of the devices necessary to achieve these conditions. This historical data can also be used to form a goal that the reinforcement learning system 110 must meet and/or beat. [0065] In some embodiments, the method for configuring the simulator 400 includes monitoring the environment and/or HVAC devices to determine baseline operation, preferably over a period of at least a week and/or at least two weeks and/or at least a month. The simulator 400 can then be configured to provide an indication of the operation of the existing system in the environment.

[0066] The target metric may depend on historical data and/or environmental monitoring. Similarly, a target metric can be based on predicted improvement. For example, the user may predict (set a condition) that a 20% efficiency improvement will (should be) achieved over current systems. The target metric can then be based on a 20% reduction in historically determined energy consumption.

[0067] Creating a virtual environment may include one or more of the following steps:

1. Creation and/or obtaining of the physical parameters necessary for modeling, for example: i. user-entered data (for example, user-entered values for temperature, humidity, etc.); ii. building information 103 and device description and configuration 102; iii. information about weather conditions 104 during the period corresponding to the simulated period. For example, for the month of August of any year. You can also use multiple time periods (over several years) to improve the reliability of the simulation. 2. Performing simulations for a given period of time in ways other than reinforcement learning. These methods may include, but are not limited to: i. user data (for example, user input equations); ii. Decision tree methods; iii. Linear data transformation; iv. Methods based on the analysis of communication (regression, etc.); v. Algorithms that use gradient boosting; vi. deterministic models; vii. stochastic models.

[0068] Typically, simulator 400 uses models trained on available, eg, time-limited, datasets. Models can use decision tree based methods or gradient boosting methods, etc. The resulting simulation is subsequently passed as part or all of the data characterizing the state of the virtual environment to the reinforcement learning system 110 . As an example, the simulator 400 may be given weather conditions for a historical month (eg, September 2000) and may then provide simulation output regarding the actions required to maintain the desired conditions for that period of time given those weather conditions. . This output can contain performance metrics such as power consumption, and the reinforcement learning system can use these performance metrics as a target.

[0069] Therefore, using the input dataset, simulator 400 can provide a target performance metric and/or target condition set to reinforcement learning system 110. Reinforcement learning system 110 includes at least one a neural network that is trained using the same input data set and matches the input data from this data set with a set of actions that must be performed by the agent 200 to achieve or improve the target conditions and/or indicators (parameters). Thus, a reinforcement learning system can be used to achieve desired/target building conditions given certain conditions and certain inputs (e.g., a reinforcement learning system can maintain a certain desired temperature range in a building). The reinforcement learning system is usually able to achieve these conditions by outperforming the simulator 400 with respect to the target metric (for example, a reinforcement learning system can reduce power consumption).

[0070] The number of outputs of the simulator 400 generally refers to the number of steps taken to train the HO system of reinforcement learning; this number of steps corresponds to the maximum number of actions that agent 200 can perform before the end of training. Thus, if the time interval chosen to train the reinforcement learning system 110 is one month, and the agent 200 is configured to perform an action (e.g., turn the HVAC devices on or off) once per day, then the simulator 400 may be configured to providing a set of inference conditions corresponding to the number of days in that month, such that at each step performed by the agent 200 during the time interval selected for training, there is a known set of target inference conditions provided by the simulator 400. The simulator 400 is typically designed to provide a metric for each step (for example, daily electricity costs); simulator 400 may additionally or alternatively provide a metric for the situation as a whole (e.g. total electricity costs).

[0071] The simulator 400 (and the HO reinforcement learning system) are typically trained using data over a significant period of time, such as several months. During this period, weather conditions and/or desired environmental conditions may change. For example, winter weather conditions may be colder than average, and the power output of simulator 400 may be correspondingly higher than average. Considering a long period of time, a reinforcement learning system can be trained to perform at a high level in a variety of conditions.

[0072] Although the simulator 400 can perform complex simulations to obtain target conditions/metrics, these conditions/metrics can also be obtained in a simpler way, for example, a simple linear or non-linear transformation of physical parameters that were present before the simulation, as well as methods using analysis relationships between these physical parameters.

[0073] An example of the simulation that can be provided by the baseline simulator 400 is the method described in the E+ (EnergyPlus) specification ("Technical Handbook, EnergyPlusTM Documentation Version 9.4.0 (September 2020); US Department of Energy: https ://energyplus.net/sites/all/modules/custom/nrel_custom/pdfs/pdfs_v9.4.0/ EngineeringReference.pdf"). This open source software requires as input a set of historical weather data for the relevant period or periods of time. For example, if the simulation is to be run in September, data from September of the previous year or previous years is required. several years. As a conclusion, the necessary information is provided on the use of HVAC units to achieve the desired conditions.

[0074] The agent 200 is configured to interact with the virtual environment module 100 based on the output of the reinforcement learning system 110. In particular, the agent 200 is configured to control devices related to the virtual environment to achieve the desired environmental conditions.

[0075] The agent 200 operates on the output of the reinforcement learning software system, where the DL reinforcement learning system receives a set of input values and then provides the agent 200 with a set of actions (e.g., outputs), where these actions are related to the operation of the HVAC device in the environment. environment. The agent 200 then implements the actions. For example, a set of weather conditions can be used as input to a reinforcement learning system; then, the reinforcement learning software system determines (using a neural network, as described below) the appropriate actions to take to obtain the desired set of conditions given those weather conditions. These appropriate actions are passed to the agent 200, which interacts with the HVAC devices in the environment to implement the actions (eg, to turn the air conditioner on or off).

[0076] Agent 200 may be combined with reinforcement learning system 110, where the reinforcement learning software system includes an agent and is designed to interact directly with HVAC devices in the environment.

[0077] A separate agent 200 and/or reinforcement learning software system may be provided for each HVAC device in the environment. Similarly, there may be a separate agent 200 and/or a reinforcement learning software system provided for each group of HVAC units in the environment (for example, an air conditioner group and a radiator group may use two separate agents). Similarly, there may be a single agent 200 and/or a reinforcement learning system 110 that is designed to control all devices in an environment.

[0078] In the context of the present solution, a reinforcement learning method is also provided, in which the reinforcement learning software system is trained so that it is suitable for controlling the agent 200. To interact with the virtual environment module 100, the agent 200 usually receives data indicative of its state, and takes an action from a set of actions as a response to this data, this action is determined by the reinforcement learning software system. Data characterizing the state of the environment can be called observation or input. The goal of the system as a whole is to maximize the "reward" received for performing activities in the described environment, where the reward is related to the achievement/exceeding of one or more indicators (eg, provided by the simulator 400). The present disclosure also relates to a method for controlling at least one HVAC device based on a trained reinforcement learning software system and/or agent 200.

[0079] Typically, the reinforcement learning software system selects the actions to be performed by the agent 200 by interacting with the virtual environment module 100. That is, the reinforcement learning system 110 receives data related to the environment and environmental conditions and selects an action from a set of actions to be performed by the agent 200. Typically, an action is selected from a contiguous set of actions. Similar an action may be selected from a discrete set of actions (eg, a limited number of possible actions).

[0080] Referring to FIG. 3, each of the components of the described system is typically implemented using the computing device 1000. Each of these components may be implemented using the same computing device, or the components may be implemented using multiple computing devices.

[0081] The computing device 1000 includes a processor in the form of a central processing unit (CPU) 1002, a communication interface 1004, a memory 1006, a storage 1008, a removable storage 1010, and a user interface 1012 connected to each other by a bus 1014. The user interface 1012 includes a display 1016 and a device input/output, which in this embodiment is the keyboard 1018 and the mouse 1020.

[0082] The CPU 1002 executes instructions, including instructions stored in memory 1006, storage device 1008, and/or removable storage device 1010.

[0083] Communication interface 1004 is typically an Ethernet network adapter connecting bus 1014 to an Ethernet outlet. The Ethernet connector is connected to the network. The Ethernet jack is usually connected to the network via a wired connection, but the connection can alternatively be wireless. It should be understood that many other means of communication (eg, Bluetooth®, infrared, etc.) may be used.

[0084] Memory 1006 stores instructions and other information for use by CPU 1002. Memory is the main memory of computing device 1000. It typically includes both random access memory (RAM) and read only memory (ROM). [0085] The storage device 1008 may be implemented as a mass storage device for the computing device 1000. In various implementations, the storage device is a built-in storage device in the form of a hard disk, flash memory, or some other similar mass storage device or an array of such devices.

[0086] Removable storage device 1010 may be implemented as an auxiliary storage device for computing device 1000. In various implementations, removable storage device is a storage medium for a removable storage device, such as an optical disk, for example, a digital versatile disk (DVD), portable flash a drive or some other similar portable solid-state storage device (computer-readable medium), or a plurality of such devices. In other embodiments, the removable storage device is remote from the computing device 1000 and includes a network storage device or a cloud storage device.

[0087] Computer device 1000 may include one or more graphics processing units (GPUs), application specific integrated circuits (ASICs), and/or one or more field programmable gate arrays (FPGAs).

[0088] A computer program product is provided that includes instructions for performing aspects of the method(s) described below. The computer program product is stored at various stages in any of the storage devices 1006, the storage device 1008, and the removable storage device 1010. The storage of the computer program product is long term, except when the instructions included in the computer program product are executed by the CPU. 1002, in which case the instructions are sometimes temporarily stored in the CPU or memory. It should also be noted that the removable storage device is removable from the computing device 1000 such that the computer program product is stored separately from the computing device from time to time.

[0089] Typically, reinforcement learning system 110 is trained using a first computing device. Once reinforcement learning system 110 has been trained on this first computing device, it can be used to control a system (eg, a HVAC system). To this end, the trained reinforcement learning software system and/or agent 200 may be output and/or transferred to another computing device.

[0090] Typically, when a system of reinforcement learning software is used to control the system, the computing device 1000 is configured to receive input either through a sensor or through a communication interface input 1004. The input data may include one or more of the following: environmental conditions, weather conditions (for example, a weather forecast), desired conditions (for example, a desired temperature entered by the user), and/or environmental configuration (for example, an indication of whether each building door).

[0091] FIG. 4 shows an exemplary neural network 10 that may be part of a reinforcement learning system 110.

[0092] The neural network 10 in FIG. 4 is a deep neural network that includes an input layer 12, one or more hidden layers 14, and an output layer 16. It will be appreciated that the example in FIG. 4 is just a simple example and in practice the neural network may contain additional layers. Also, although this exemplary neural network is a deep neural network, containing a hidden layer 14, more generally, a neural network could simply contain any layer that maps an input layer to an output layer.

[0093] A neural network is based on parameters that map inputs (eg, observations) to outputs (eg, actions) to achieve a desired outcome for a given input. These parameters usually contain weights. To determine the parameters that ensure efficient operation, the neural network is trained using training sets of input data. In the present system, the inputs to the neural network may be referred to as observations, with each observation characterizing the state of the environment (eg, current conditions).

[0094] Typically, neural network training consists of several stages, in which the parameters of the neural network are updated after each stage of training based on the performance of the neural network at that stage. For example, if a parameter change is defined to improve the performance of the neural network during the first training step, a similar change can be implemented before the second training step. Typically, training a neural network involves a series of steps to ensure that the parameters provide the desired output for a range of input datasets.

[0095] According to the present disclosure, the neural network is trained based on the simulator 400, where the metric provided by the simulator 400 allows the neural network to be trained faster than using traditional training methods.

[0096] Referring to FIG. 5, a reinforcement learning system HO typically comprises an actor-critic (Actor-Critic) system 20 .

[0097] The actor-critic system 20 comprises two separate neural networks, one of which is an actor neural network (by the performer neural network - Actor) 22, and the other by the critic neural network 24 (Critic).

[0098] The actor neural network (Actor) 22 is designed to receive input from the virtual environment module 100 and have it processed in response to actions that the agent 200 can perform. as input from the environment, and then determine that the agent 200 should turn on the air conditioning unit (note that the actor neural network usually maps input tuples to output tuples without knowing the meaning of these tuples, so the actor neural network does not know that this output refers to the switched on air conditioner).

[0099] The critical neural network (Critic) 24 is designed to receive a state, or input, from the virtual environment module 100 and an action from the neural network 22 (Actor), and to determine a value related to this action and state. Therefore, the Critic neural network determines the result of the action proposed by the Actor neural network.

[0100] In a practical (basic) example of the actor-critic system 20, the simulator 400 may indicate that a temperature of 23°C is desired. The virtual environment module 100 may then provide a state value that indicates that the temperature is 25°C for the first time. Given this input state, neural network actor 22 may return an action to turn on the heat emitter. Critic neural network 24 determines the result of this action - for example, this action is likely to lead to an increase in temperature (thus, the temperature for the second time can be 27 ° C). Then, the critical neural network 24 receives the environment state values for the first and second time and determines that the temperature has risen and that the proposed action had a negative effect. It's being transmitted back to the neural network-actor 22 (Actor), and in response, the parameters of the neural network-actor 22 (Actor) are changed so that in the same situation, a similar action is not offered in the future. Therefore, the parameters of the neural network-actor 22 (Actor) are changed based on the feedback from the neural network-critic 24.

[0101] The information provided to the neural network actor 22 (Actor) by the neural network 24-criticism (Criticism) contains a temporal difference (TD) error. This TD error takes into account the passage of time so that it can be taken into account, for example, that turning on a radiator may only have an effect after a certain waiting period).

[0102] The feedback from the critic neural network 24 may include an indication of an error between the condition reached and the desired condition, with the parameters of the actor neural network 22 (Actor) being updated to minimize this error.

[0103] Neural network-critic 24 provides a subjective assessment of the benefits of the current state of the neural network-actor 22 (Actor). To improve this score, especially in the early stages of training, the simulator 400 can be used to determine the appropriate parameters for the critic neural network 24. In particular, the reward used to train one of the neural networks can be based on the difference between the target metric and the metric, associated with the current parameters of this neural network. In practice, the action offered by the neural network actor 22 (Actor) is usually associated with a metric (eg, electricity usage). The modifications made to the parameters of the actor neural network 22 (and/or the critic neural network 24) during the training phase may depend on the difference between this metric and the target metric provided by the simulator 400. [0104] In this regard, the parameters of the neural network-critic 24 and thus the determined value may depend on the condition and/or metric received from the simulator 400. In a conventional system, the neural network-critic 24 may identify that the proposed set of actions of the neural network -actor 22 (Actor) refers to an electricity cost metric of 100 kWh, however, without the context of a certain situation, the critical neural network 24 cannot determine whether this price is good or bad, and the critical neural network 24 usually needs to know whether it is whether a certain outcome is good or bad, through numerous learning steps. Therefore, in the early stages of training, the critic neural network 24 can only provide limited feedback to the actor neural network 22 (Actor), and the parameters of the actor neural network 22 (Actor) can only be changed slowly. By providing a metric from simulator 400, the critical neural network can learn much faster. For example, if the simulator 400 achieved an electricity cost metric of 20 kW/h for the same situation, the critic neural network 24 can quickly determine that the current parameters of the actor neural network 22 (Actor) that resulted in the electricity cost metric of 100 kWh/ h, differ significantly and are not optimal.

[0105] Thus, by training the critic neural network 24 and/or the actor neural network 22 (Actor) depending on the condition and/or metric provided by the simulator 400, the training time of the reinforcement learning system 110 can be reduced.

[0106] The simulator 400 can be used to quickly train the actor neural network 22 (Actor) and the critic neural network 24 to provide an agent 200 that achieves performance at least similar to the simulator 400. Then the neural network actor 22 (Actor) and neural network-critic 24 can continue training so that the reinforcement learning system begins to outperform simulator 400. Using simulator 400 reduces the total training time required for the reinforcement learning software system.

[0107] Actor-critic system training may include: a. receiving mini-packages of experience tuples from the virtual environment. Each experience tuple contains a learning observation characterizing the learning state of the virtual environment, a learning action from the possible action space for agent 200, a learning reward associated with agent 200 for performing the learning action, and a next learning observation characterizing the next learning state of the virtual environment; b. updating the current values of the parameters of the neural network-actor 22 using the received mini-packets of experience tuples. For each experience tuple in the mini-batch, a new update typically contains the following steps: c. processing the learning observation and learning action in the experience tuple using neural network-critic 24 to determine the output of the neural network for the experience tuple according to the current values of the parameters of the neural network-critic 24 and updating the current values of the parameters of the neural network-actor using the neural network- criticism;

At the same time, a new experience tuple is created. This stage usually consists of the following steps: a. getting a new training observation; b. processing this new learning observation using the neural network actor 22 (Actor) to select a new learning action to be performed by the agent 200, in according to the current values of the parameters of the neural network-actor 22; V. receiving a new learning reward in response to agent 200 performing this new learning action; d. obtaining a new next training (training) observation; e. creating a new experience tuple that includes, as described above, a new learning observation, a new learning action, a new learning reward, and a new next learning observation.

[0108] An example of an existing system that uses the actor-critic scheme is the Asynchronous Advantage Actor-Critic (AZS) https://www.machineleamingmastery.ru/the-idea-behind- actor-critics-and-how-a2c-and-a3c-improve-them-6dd7dfd0acb8/). The methods disclosed herein may be implemented using such a system. Likewise, the methods disclosed herein can equally be implemented using a Mixture-Density Recurrent Network (MDN-RNN).

[0109] In some embodiments, reinforcement learning system 110 does not include an actor-critic system. Reinforcement learning system 110 will still include at least one neural network. In such systems, simulator 400 typically provides feedback on the reward model that is used to evaluate the performance of the neural network.

[IT] In FIG. 6 shows an embodiment of reinforcement learning system 110 in more detail.

[0111] In this embodiment, the reinforcement learning system 110 is designed to select actions (outputs) using the global neural network module of the virtual environment 300. The global neural network is a neural network that is configured to accept a set of inputs and process the input to associate those inputs with an action (for example, if the temperature rises, turn on the air conditioner). Typically, this includes selecting a point in the continuous action space by the global neural network that determines the action to be performed by the agent 200 in response to the input.

[0112] This global neural network 300 is typically trained through reinforcement learning. Similarly, a global neural network can be trained via supervised or unsupervised learning (so that the disclosure of reinforcement learning system 110 is more generally the disclosure of a learning system that contains at least one neural network).

[0113] The global neural network 300 typically includes a global actor neural network 301, which provides a function for matching inputs to outputs (e.g., actions), and a global critic neural network 302, which is designed to perform actions, and inputs ( states) as input and to process actions and inputs to create neural network outputs. During training, the reinforcement learning software system manages the parameters of the critic global neural network and the actor global neural network.

[0114] The reinforcement learning system disclosed here has several differences from current architectural solutions.

[0115] For training global actor neural network 301 and global critic neural network 302, the reinforcement learning system comprises at least one child neural network 310-1, 310-2, 310-N; each child neural network contains a child neural network-actor 311-1, 311-2, 311-N and a child neural network-critic 312-1, 312-2, 312-N. These daughter neurons the networks interact with the virtual environment module 100 at the same time (as shown in FIG. 6). This is achieved by providing respective copies of the virtual environment 101-1, 101-2, 101-N with a separate copy of the virtual environment created and provided for each of the child neural networks.

[0116] It should be understood that any number of child neural networks 310 may be used, where the number used may be implementation dependent.

[0117] Each copy of the virtual environment has different properties that can be set using different initialization conditions. Typically, these properties refer to different configurations of the target real physical environment (eg, different building configurations). For example, each copy of the virtual environment may refer to a different population density or a different configuration of objects in the real physical environment (eg, different sets of doors that open or close). Copies of the virtual environment may be initialized based on a number of parameters, each of which may be randomly selected or selected by the user. These parameters typically include one or more of the following: initial environmental conditions (eg initial temperature or humidity); initial state outside the environment; initial employment and/or occupancy; and characteristics of the devices (for example, the type of coolant used).

[0118] Therefore, each child neural network 310 is trained based on a different environment with different properties. Thus, child neural networks may be structurally identical and/or may be initialized with the same parameters, and these parameters will differ due to the interaction of child neural networks with different versions of the environment. Similar child neural networks can be initialized with different parameters. A plurality of similar copies of the virtual environment can also be provided, where these similar copies are associated with differently initialized child neural networks.

[0119] Typically, simulator 400 is configured to provide a metric and/or set of target conditions for each instance of virtual environment 101. For example, modeling a baseline for a crowded environment may result in higher energy consumption (and other measure of energy usage) than in a sparse environment.

[0120] For each input tuple at each stage, the reinforcement learning software system is configured to use the child critic neural networks 312 to determine updates to the child actor neural networks 311 current parameters and child critic neural network weights. After each step, the parameters of child actor neural networks 311 and child critic neural networks 312 change depending on certain updates.

[0121] Child actor neural networks 311 and child critic neural networks 312 are sent in accordance with process 35 depicted in FIG. 6 and 7, the results of their work (for example, gradients that provide an indication of the performance of these child neural networks 310) to the global neural network 300. The global neural network then accumulates these training results and uses them to change the parameters of the global actor neural network 301 and the global critic neural network 302. In particular, global neural network 300 is able to identify parameters and/or parameter modifications that result in improvements for various child neural networks 310 so that global neural network 300 parameters can be updated accordingly. [0122] The structure of each neural network is typically as follows: The global neural network 300 is trained by child neural networks 310. networks are defined based on gradients and/or parameters received from child neural networks.

[0123] Typically, the parameters of the global neural network 300 are determined depending on the average value of the gradients provided by the child neural networks 310. These gradients can be applied to the existing parameters of the global neural network (and, more specifically, to the existing parameters of the global neural network-actor 301 and the global neural network-critic 302) to get updated parameters.

[0124] In turn, the child neural networks 310 are configured to periodically receive parameter updates from the global neural network 300 when a condition is met (eg, when a certain number of training steps have been completed) and/or when a given set of actions has been performed. Typically, a global neural network is designed to periodically send copies of its parameters to child neural networks, as shown in FIG. 7, where these parameters can replace the existing parameters of each child neural network.

[0125] Child neural networks 310 are trained based on different copies of the virtual environment 101 having different properties, so they are designed for different situations. For example, the first child neural network 310-1 may provide optimal performance when the environment is crowded (crowded room and/or rooms), the second child neural network 310-2 may provide optimal performance when the environment is sparsely populated (sparsely occupied premises and/or premises).

[0126] By periodically passing their gradients and/or parameters to the global neural network 300, child neural networks 310 can train the global neural network to provide good performance in a number of types of environments (eg, environments with a number of different properties). Periodically, the parameters of the global neural network are transmitted to the child neural networks, therefore, the child neural networks indirectly transmit their parameters. For example, the first child neural network 310-1 can be trained based on the first copy of the virtual environment 101-1, which is characteristic of a crowded environment, and the second child neural network 310-2 can be trained based on the second copy of the virtual environment 101-2, which characteristic of a sparsely populated environment. The first child neural network can be trained indirectly for use in a sparsely populated environment by receiving parameters from a global neural network that has previously received gradients and/or parameters from the second child neural network. Eventually, neural networks converge (neural networks converge) to provide a series of similar and/or identical child neural networks (and a global neural network that is the same) that provide optimal performance for a wide range of environments (sparsely populated environment, crowded environment etc.).

[0127] Once said convergence has occurred, reinforcement learning system 110 can be used to provide suitable output given a given set of inputs. This convergence (or near convergence) can be indicated, for example, by displaying a notification to the user, or by providing output based on input to reinforcement learning system 110. [0128] The frequency of data transmission (e.g., transmission of gradients/parameters from child neural networks 310 to global neural network 300 and transmission of parameters from global neural network to child neural networks) can be based on multiple steps, user input, or rate of convergence.

[0129] If this periodicity is too small, for example, if the parameters are passed too often, the child neural networks cannot diverge enough to ensure optimal performance of their copy of the virtual environment. If the frequency is too high, the time required to train the reinforcement learning system can become excessive. Therefore, in various embodiments, the global neural network is designed to provide parameters to child neural networks no more than once every 20 steps, no more than once every 50 steps, and/or no more than once every 100 steps. Similarly, in various embodiments, the global neural network is designed to provide parameters to child neural networks at least once every 200 steps, at most once every 100 steps, and/or at least once every 50 steps.

[0130] Typically, one or more of the following functions are implemented by reinforcement learning system 110 as shown in FIG. 8:

1) At each training step, for each copy of the virtual environment 101 and agent 200, each child neural network-actor 311 receives a set of tuples from the corresponding copy of the virtual environment 101 - in accordance with the process 31. Each tuple contains data characterizing the state of the environment, an action from a given area of action performed by agent 200 in response to the data, the reward for the action taken by agent 200, and the next a set of input data that characterizes the next learning state of the environment.

2) Each child actor neural network 311 then updates the 34 current values of its parameters using the data obtained from the following sequence of actions: a) Each input and action in the resulting tuple is processed, in accordance with process 32, using each child neural network critic network 312 to determine the output of the corresponding child actor neural network 311 for the tuple according to the current parameters of the child actor neural network 311 and the child critic network 312. Process 32 determines a predicted outcome or expected reward (e.g., expected performance) child neural network 310 for the received tuple of the training reward and the next set of training inputs in the same tuple. The current parameter values of child actor neural network 311 and child critic neural network 312 are then updated, according to process 33, using an estimate of the benefit of the predicted outputs of child neural network 310 and reward from the environment, and based on the condition(s) and the metric(s) obtained from the simulator 400. For example, if the simulator 400 indicates that the child neural network is performing well below the maximum known possible performance, the parameters can be substantially changed; if the simulator 400 indicates that the child neural network is operating close to the maximum known possible performance, the parameters may only be changed by a small amount. The reward is usually based on the difference between the output of the actor's child neural network and the metric provided by the simulator; b) updating, in accordance with process 33, the current parameters of each child actor neural network 311 using the corresponding child critic neural network 312 and, for example, the entropy loss function.

3) Periodically, after a predetermined condition is met (for example, after a certain number of actions have been processed by the child neural network 310) or, where applicable, a given training step has been completed, the current values of the parameters of the global actor neural network 301 and global critic neural network 302 are updated, in accordance with process 35, based on gradients and/or parameters sent by child neural networks. Each child neural network may send these gradients/parameters at the same time, or the timing of the gradients/parameters may differ for different child neural networks.

4) Periodically, for example, in accordance with a predetermined condition, the parameters of the child actor neural network 311 and the child critic neural network 312 are updated, in accordance with the process 36, using the parameters of the global neural network 300.

[0131] An in-process update 36 of the current parameters of each critic child neural network 312 is performed using the error between the reward given from the environment for performing the selected action and the child actor neural network 311 output generated by the critic child neural network 312 for evaluation expected reward (for example, the difference between the desired and achieved value). Optionally, the error may be used between the output of the child neural network actor 311 for the current observation and the results of the simulator 400. Typically, the system may determine an update to the current parameters, reducing the error using conventional machine learning and optimization techniques such as performing backpropagation gradient descent iteration.

[0132] To update the parameters of the child neural network actor 311, the reinforcement learning system 110, having processed the input data using the child neural network actor 311, selects the next action and updates these parameters using one or more of the following:

• the gradient provided by the critic child network 312 (gradient 1) with respect to the next action taken in processing the associated inputs and outputs in accordance with the current parameters of the critic child neural network (eg, improving and/or reducing performance, which is determined by child neural network-critic will take place if the action is proposed based on the current parameters of the child neural network-actor);

• the gradient provided by the child actor network 311 (gradient 2) with respect to the parameters of the child actor network 311 taken during the training observation and in accordance with the current values of the parameters of the child actor neural network 311.

[0133] In fact, if changing the parameters of the child neural network-actor 311 had a positive effect on the child neural network (for example, if this change leads to the achievement of the same conditions at lower energy costs), then a similar change in parameters can be implemented again. In addition, if a previous change had a large positive effect, then a large similar change can be implemented. On the other hand, if changing the parameters of the child neural network actor 311 had a negative effect, then an alternative change can be implemented. [0134] The reinforcement learning software system may compute gradient 1 and gradient 2 by backpropagating the respective gradients through the respective neural networks.

[0135] In general, the reinforcement learning software system performs this process for each input tuple after each update of the parameters of the child critic 312 or the child actor 311 (eg, after each training step). Once the updates for each tuple have been calculated, the reinforcement learning software system updates the current parameters of each child actor neural network 311 and each child critic neural network 312 using the tuple updates. Thus, reinforcement learning system 110 iteratively improves the parameters of each of the neural networks.

[0136] Periodically, using the gradients of child actor neural networks 311 and child critic neural networks 312, reinforcement learning system 110 updates, in accordance with process 36, the parameters of global actor neural network 301 and global critic neural network 302.

[0137] Desynchronization of the parameters from the global neural network 300 occurs using the condition applicable to the child neural networks 310 (for example, this occurs when a certain number of actions have been performed by the child neural networks), and usually occurs by copying the parameters of the global neural network 300 after the stage of updating the parameters of the global neural network.

[0138] After processing a certain number of mini-packets of data, the global neural network 300 resets the gradients, and the process is repeated until the agent 200 achieves the required decision quality. [0139] Thus, by iteratively updating the current values of the parameters, the reinforcement learning software system trains the global neural network 300 and child neural networks 310 to generate neural network outputs that represent the cumulative time-amortised future rewards that can be obtained in response for agent 200 to perform a given action in response to a given set of inputs.

[0140] Considering this from a practical point of view, the first child neural network 310-1 can interact with a copy of the virtual environment 101-1, which refers to a crowded environment (high-density room or rooms). After a series of training steps, the parameters of the first child neural network will reach values that ensure good performance of the agent 200 for this type of environment. Similarly, the second child neural network 310-2 may interact with a copy of the virtual environment 101-1 that is sparsely populated (with a low density of room or rooms) and may have parameters that make the agent 200 work well for this type of environment. . The global neural network 300 receives gradients from each of these child neural networks and uses these gradients to form a neural network that performs well in each environment. Then these parameters of this global neural network are periodically copied to child neural networks. The parameters may not perform optimally for one or more types of environments, so the child neural networks are retrained based on the appropriate copies of the environments (and then the parameters of the child neural networks are again provided to the global neural network). Over time, all neural networks converge into a neural network that provides optimal performance for each type of environment. [0141] It is also possible to apply these techniques using multiple agents 200, where each agent may have a different set of action spaces, each interacting sequentially or simultaneously with copies of virtual environments 101. For example, the first agent 200-1 may be configured to manage air conditioners, while the second agent 200-2 could be configured to control radiators (and each of these agents would have different possible actions and different areas of action). For each of these agents, a separate global neural network-actor, a global neural network-critic, child neural networks-actors and child neural networks-critics can be created. This allows you to quickly train neural networks based on different action spaces.

[0142] An illustration of the disclosed interaction scheme for a system containing multiple agents is shown in FIG. 9. Referring to this illustration, each of the agents 200-1, 200-2, 200-K interacts with a respective reinforcement learning system 110-1, 110-2, 110-K. Each of these reinforcement learning systems contains many child neural networks. Each of these reinforcement learning systems is designed to interact with a common set of copies of the virtual environment 101-1, 101-2, 101-N, where for each of the copies of the virtual environment, one child neural network from each reinforcement learning system interacts with this copy.

[0143] It should be understood that reinforcement learning systems can also be trained completely independently with completely separate copies of virtual environments.

[0144] In the illustration in FIG. 9 backpropagation and parameter updates for these child neural networks and global neural networks can occur independently of each other. each other, for example, each reinforcement learning system 110-1, 110-2, 110-K can be trained separately, and they can have neither common parameters nor common gradients.

[0145] In some embodiments, copies of virtual environments 101-1, 101-2, 101-N are forced to take actions from all agents 200 at the same time (in one tuple). In particular, all agents perform actions and add their actions to the combined tuple, which is then passed to the environment to receive an updated state. In practice, this may mean that for each copy of the virtual environment, the actions related to the first agent 200-1 are first performed (for example, turning air conditioning units on or off). Actions related to the second agent 200-2 are then performed (eg turning radiators on or off).

[0146] In cases where interaction with a copy of the virtual environment 101 does not occur simultaneously, the order of interaction with the environment 101 is first established, after which the agents 200 perform actions in a given order in turn. By performing actions in this way, each agent, in turn, affects the environment 101, so that each agent receives the state of the environment 101 in the training tuple. This state will be influenced by agents in an earlier order.

[0147] Agents 200 having an identical set of available actions are typically combined into a single agent having one set of global actor neural networks, global critic neural networks, child actor neural networks, and child critic neural networks. This avoids redundant training (which wastes available computing power). For example, there may be multiple agents, each controlling similar HVAC units. These agents may have the same action space, so they can be combined and trained at the same time. The action space may be determined based on the set 102 of configuration data and input device descriptions.

[0148] Typically, at each training step, child actor neural networks 311 and child critic neural networks 312 train a policy, calculate the effect of changing the parameters of these neural networks, and update the gradients in the global neural network 300 using not only the virtual environment and information about taken or potential actions, but also the results of the 400 baseline simulator in addition to the entropy estimation function to improve the quality of training. The key performance indicators or desired conditions provided by the simulator 400 are also used by the critic child neural network to evaluate the quality of the selected policy. For example, child critical neural networks can determine if desired conditions have been met and then can evaluate a performance metric to determine if the neural network is performing well.

[0149] Typically, the simulator 400 provides a separate metric for each child network and/or for each copy of the virtual environment.

[0150] The output of simulator 400 typically contains a number of metrics along with rewards/priorities for those metrics. The rewards accumulated by the simulator 400 are then compared with the rewards accumulated by each agent 200.

[0151] The performance of each agent 200 may then be compared to the simulator 400. For example, the metric provided by the simulator 400 may relate to the performance of the simulator 400 (e.g., how close did the simulator 400 come to achieving a certain temperature and/or humidity, and how much power was required to achieving these conditions). This the simulator 400 metric can then be compared with each agent to determine if any of the agents outperformed the reference simulation (simulation).

[0152] The process of determining an update for the current parameter values of the global neural network 300 or child neural networks 310 may be performed by a system of one or more computers located in one or more locations.

[0153] In addition, any neural network may include one or more layers of batch normalization.

[0154] By iterating the above processes many times using many different subsets of experience tuples, reinforcement learning system 110 is adapted to train global neural network 300 to have optimal parameters for controlling HVAC devices in the environment to achieve a desired set of conditions (which desired set of conditions can be based on user choice or neural network output). The actions selected by the global neural network given a particular set of inputs are provided to the agent 200, which is adapted to interact with the environment (eg, HVAC devices in the environment) in accordance with the provided actions.

[0155] Unlike existing methods, additional sources of reinforcement modeling are used to update all parameters in neural networks. Thus, compared to actor-critic asynchronous advantage methods, not only randomly selected or learned values of the value function are used as the main indicators that learning is focused on, but also key indicators or baselines are modeled that affect the time needed to converge. optimal policies. [0156] An example of the use of the disclosed methods is to maintain the desired conditions inside the building. In this use, agent 200 can be adapted to modify the operation of radiators, air conditioners, dehumidifiers, and so on. to maintain the desired range of temperatures, humidity, etc.

[0157] As shown in FIG. 10, once reinforcement learning system 110 has been trained, the reinforcement learning system and/or one of the reinforcement learning system's component neural networks can be used to control devices in an environment such as room 10. FIG. 10 describes an embodiment in which a reinforcement learning system is used in conjunction with an agent 200 to control an HVAC device or HVAC system. An HVAC system is a system containing a plurality of HVAC devices, which may include a central control system or a controller (eg, a computer device for controlling a plurality of computer devices).

[0158] In the first step 41, the (trained) neural network takes as input information about the change in the environment (environment). This change may refer to, for example: a change in the weather, a change in population in the environment, and/or the opening and closing of doors. The input signal can be obtained from the sensor of the HVAC unit.

[0159] Typically, a plurality of HVAC devices are connected (connected) to form an HVAC system that can be controlled by a controller. The controller may then send input to each of the HVAC units, for example, a desired change in conditions or a change in environmental conditions may be sent to the HVAC units. Similarly, a controller can contain a neural network where the controller determines the appropriate action for each of the HVAC units and communicates these actions to the HVAC units. The controller may comprise a plurality of neural networks that are associated with different HVAC devices and/or groups of HVAC devices.

[0160] Similarly, a neural network may be present on the computing device of a single HVAC device.

[0161] In the second step 42, the neural network provides (eg, recommends) an action based on the input. The action is determined based on the parameters of the neural network whose parameters have been determined using the learning method described above, so that the neural network provides the appropriate action for the given input.

[0162] In the third step 43, the neural network implements the action via the agent 200. For example, the agent may change the operation of the HVAC device.

[0163] In a fourth step 44, the HVAC device and/or HVAC system provides output related to the activities. For example, an HVAC device may display energy consumption to a user and/or provide an audible alert indicating a deviation from a set of target conditions. Typically, each HVAC device and/or HVAC system is adapted to periodically output usage information related to the use of the HVAC devices in the environment.

[0164] In practice, using a reinforcement learning software system to control an HVAC device typically involves using one of the constituent neural networks (eg, global actor neural network 301) to control the HVAC device.

[0165] This neural network (eg, global actor neural network 301) can be trained using the methods disclosed herein, and then can be transmitted to an HVAC device or controller for an HVAC system. This may include transfer parameters of the trained global neural network-actor to the HVAC device or controller. This trained neural network can then control the HVAC device depending on inputs received (eg, measured environmental changes and/or signals received from the HVAC control system).

[0166] Therefore, typically, the reinforcement learning software system as a whole is present only during training, with global neural network 300 (and/or global neural network actor 301) being transferred from the training device to the HVAC device after neural the networks converged and the training of the reinforcement learning system was completed.

Alternatives and modifications:

[0167] It should be understood that the present invention has been described above by way of example only, and changes in detail may be made within the scope of the invention.

[0168] For example, training reinforcement learning system 110 may include one or more of the following: unsupervised deep learning and/or shallow learning, generative adversarial networks (GANs), variational autoencoders (VAEs). Such techniques may allow a once trained HVAC device model to be reused for similar HVAC devices with dissimilar action spaces. While reinforcement learning software typically uses reinforcement learning, other machine learning methods can be used instead of or in addition to reinforcement learning.

[0169] The detailed description primarily addressed the use of actor-critic systems. More generally, reinforcement learning system 110 may use any type of learning system, where the reinforcement learning system would normally still includes a link with a global learning system and many child learning systems.

[0170] The type of machine learning systems used may be determined based on the environment and/or input datasets.

[0171] In some embodiments, the reward generation function used to train the actor-critic system is dynamic and/or selected from a variety of possible reward generation functions. For example, the reward generation function may depend on the rate of convergence and/or difference in performance related to key metrics (for example, if the actor-critic system is close to optimal for the first of the key metrics, but performs poorly for the second key metric, then the generation function rewards can be modified to prioritize the second key metric).

[0172] In some embodiments, training reinforcement learning system 110 may include hyperparameter auto-correction.

[0173] In some embodiments, multiple agents are trained at the same time, and these agents have different reward generating functions. This may include training multiple agents for a single HVAC device or for a group of HVAC devices. These agents can then be activated and used depending on the desired set of conditions. For example, a first group of agents may be trained to perform optimally for a first set of target conditions, and a second group of agents may be trained to perform optimally for a second set of target conditions.

[0174] The environment (environment) is usually associated with the building. In some embodiments, the environment refers to a plurality of related buildings (eg, a plurality of neighboring buildings). Reinforcement learning system 110 and agent 200 can then be trained to work together optimally (eg, one building may be affected by turning on the air conditioning for another building).

[0175] In some implementations, game theory is used to determine the rational allocation of resources.

[0176] While the invention has been shown and described with reference to certain embodiments, it will be appreciated by those skilled in the art that various changes and modifications may be made therein without departing from the actual scope of the invention.

Claims

Claim

1. A control system for at least one heating, ventilation and air conditioning (HVAC) device, comprising:

- a neural network of the control system of at least one HVAC device, made with the possibility of learning through the learning system;

- training system for the neural network of the control system;

- simulator of basic indicators;

- virtual environment module;

- controller; And

a memory storing instructions causing said control system neural network to be trained by the control system neural network training system in accordance with the steps including: a) obtaining an input data set; b) creating, by means of a virtual environment module, a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the said at least one virtual HVAC device model; c) performing, by the baseline simulator, simulating an operation mode of said at least one virtual HVAC device in said virtual room model based on the received input data set; d) obtaining a target metric based on the simulation performed; And

52

SUBSTITUTE SHEET (RULE 26) e) performing training of the neural network of the control system in accordance with the received target metric and in accordance with the input data set;

wherein the controller is configured to generate control instructions and transmit said instructions to at least one HVAC device, where said control instructions are generated based on the received target metric in accordance with step d).

2. The system according to claim 1, characterized in that the target metric includes at least one of the following indicators or combinations thereof: electricity and/or power consumption, cost of electricity and/or power, operating time of the HVAC unit, number of changes in operation of the HVAC unit, deviation from target conditions.

3. A system according to any one of claims 1 or 2, characterized in that the learning system comprises an actor-critic system comprising an actor neural network and a critic neural network.

4. The system according to any one of the preceding claims, characterized in that the learning system comprises a global actor-critic learning system and one or more child actor-critic learning systems, wherein:

- the global system actor-critic contains a global neural network-actor and a global neural network-critic;

- one or more child actor-critic systems contain a child neural network-actor and a child neural network-critic.

5. The system according to claim 4, characterized in that the training of the neural network of the control system includes:

- transferring gradients from at least one child learning system to the global learning system and updating the parameters of the global learning system based on the gradients received from at least one child learning system;

53

SUBSTITUTE SHEET (RULE 26) copying parameters from the global learning system to at least one child learning system; And

- repetition of the steps of transmitting gradients and copying parameters, while repeating the steps of transmitting gradients and copying parameters occurs until each of the neural networks of the child system and the global system converge.

6. The system according to any of the preceding claims, characterized in that training the neural network of the control system includes determining the combination of at least two neural networks to provide a given mode of operation of at least one virtual HVAC device in accordance with said virtual environment.

7. The system according to any one of paragraphs. 4-6, characterized in that training the neural network of the control system includes training at least one child system using a separate copy of the virtual environment, each of the copies of the virtual environment has different initialization properties, and the initialization properties include at least , one of the following values, or a combination thereof: occupancy density in the room, different configurations of at least one virtual HVAC device, and/or different configurations of the virtual room model.

8. The system of claim 7, wherein the baseline simulator is configured to provide a separate target metric for at least one child system, wherein the separate target metrics for said child system are determined depending on the corresponding separate copy of the virtual environment.

9. The system according to any one of the preceding claims, characterized in that the neural network of the control system is trained depending on the set of target conditions provided by the baseline simulator, where the set of target conditions includes one or more of the conditions:

54

SUBSTITUTE SHEET (RULE 26) indoor and/or outdoor temperature, indoor and/or outdoor humidity.

10. The system according to any one of the preceding claims, characterized in that the neural network is trained depending on a plurality of target indicators and/or sets of target conditions provided by the baseline simulator, where each target metric is used by the learning system for a separate training step.

11. The system according to any one of the preceding claims, wherein training the neural network of the control system comprises creating a reinforcement learning agent configured to operate depending on the training system, wherein the agent is configured to interact with at least one virtual HVAC device in said virtual environment.

12. The system of claim 11, wherein the baseline simulator is configured to provide a plurality of targets and/or target conditions depending on the maximum number of actions that said agent can take during a predetermined learning period.

13. The system according to any of the preceding claims, characterized in that the baseline simulator is configured to provide a target metric depending on the input data set containing a time-limited data set, while the neural network of the control system is trained depending on the same time-limited data set time.

14. The system according to any of the preceding claims, characterized in that the baseline simulator is configured to provide a target metric without using the neural network of the control system and/or to provide a target metric using one or more of the following methods: tree method

55

SUBSTITUTE SHEET (RULE 26) solutions, linear data transformation method, methods based on link analysis, methods using gradient boosting.

15. The system according to any of the preceding claims, characterized in that the baseline simulator is configured to provide a target metric depending on one of the following: user input to control the operation of at least one HVAC device in said virtual environment and/or stored data related to the operation of the virtual HVAC units.

16. The system according to any one of the preceding claims, characterized in that the input data set contains one or more of the following data: description and/or configuration data of HVAC devices in said virtual environment, building layout and/or zoning data, climatic conditions data.

17. The system according to any one of the preceding claims, characterized in that training the neural network of the control system includes providing a plurality of said agents, where each of the agents is configured to perform a plurality of actions and/or communicate with at least one HVAC device.

18. The system according to any one of the preceding claims, characterized in that training the control system neural network includes training a plurality of learning systems, where training of each learning system of the plurality of learning systems takes place in said environment for each HVAC virtual device and/or for each group of the set groups of similar HVAC virtual appliances.

19. The system according to any one of the preceding claims, characterized in that training the neural network of the control system includes training a plurality of learning systems associated with a plurality of said agents, each learning system being configured to interact with a respective agent.

56

SUBSTITUTE SHEET (RULE 26)

20. The system according to claim 19, characterized in that the training of the neural network of the control system includes:

- training the first child system of the first learning system and the first child system of the second learning system depending on the first copy of said virtual environment;

- training a second child system of the first learning system and a second child system of the second learning system depending on the second copy of said virtual environment; wherein the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment simultaneously and/or the first child system of the first learning system and the first child system of the second learning system are configured to interact with the first copy of said virtual environment according to a predetermined condition.

21. A method for controlling at least one heating, ventilation and air conditioning (HVAC) device, comprising the steps of:

- get the input data set;

- creating a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the said at least one virtual model of the HVAC device;

- performing an operation mode simulation of said at least one virtual HVAC device in said virtual room model based on the received input data set; obtaining a target metric based on the simulation performed; And

57

SUBSTITUTE SHEET (RULE 26) - performing training of the neural network of the control system, at least one HVAC device in accordance with the received target metric and in accordance with the input data set;

- by means of the controller, control instructions are generated; And

- transmitting said instructions to at least one HVAC device, where said control instructions are generated based on the received target metric.

22. A computer-readable medium that stores a computer program product containing instructions that, when executed by the processor, cause the processor to perform the method of claim 21.

23. A method for training a neural network of a control system with at least one heating, ventilation and air conditioning (HVAC) device, including the steps of:

- get the input data set;

- creating, by means of a virtual environment module, a virtual environment associated with one or more HVAC devices, based on the received input data set, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the mentioned at least one virtual HVAC device model;

- performing a baseline simulator simulation of the operation mode of said at least one virtual HVAC device in said virtual room model based on the received input data set; obtaining a target metric based on the simulation performed; And

- perform training of the neural network of the control system in accordance with the received target metric and in accordance with the input data set.

58

SUBSTITUTE SHEET (RULE 26)