CN111505944B

CN111505944B - Energy-saving control strategy learning method, and method and device for realizing air conditioning energy control

Info

Publication number: CN111505944B
Application number: CN201910091191.XA
Authority: CN
Inventors: 谭建明; 李绍斌; 宋德超; 陈翀; 罗晓宇; 邓家璧; 王鹏飞; 肖文轩; 岳冬
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Zhuhai Lianyun Technology Co Ltd
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2021-06-11
Anticipated expiration: 2039-01-30
Also published as: CN111505944A

Abstract

The invention provides an energy-saving control strategy learning method, a method and a device for realizing air conditioning energy control, wherein the energy-saving control strategy learning method combines a Monte Carlo method and a reinforcement learning method, obtains an approximate solution of a problem by using a Monte Carlo sampling method, observes the transferred state and the obtained reward by executing a selected action on the current air conditioning environment, estimates a return value according to the sample average of the return values of all the states, and finally obtains an optimal control strategy. The invention also provides a method for realizing the air conditioning energy control based on the energy-saving control strategy learning method. The invention searches for the optimal control strategy through continuous interactive learning of the air conditioner operating environment, thereby achieving energy-saving control.

Description

Energy-saving control strategy learning method, and method and device for realizing air conditioning energy control

Technical Field

The invention relates to the technical field of smart homes, in particular to an energy-saving control strategy learning method, and a method and a device for realizing air conditioning energy control.

Background

With the rapid development of science and technology, modern people are increasingly not satisfied with the existing living conditions, and instead, the people are urgently pursuing more comfortable living environments. At present, with the great improvement of the living standard of people, the air conditioner becomes one of the necessary household appliances for more and more families. However, the air conditioner consumes a large amount of electricity, which is a problem that consumers and manufacturers are very headache. In addition, the existing air conditioner control method mainly controls the temperature, and the energy-saving control of the air conditioner is difficult to realize due to the complex operation environment.

Disclosure of Invention

The invention provides an energy-saving control strategy learning method, a method and a device for realizing air conditioning energy control, which are used for searching an optimal control strategy through continuous interactive learning of an air conditioner operating environment so as to achieve air conditioner energy-saving control.

In a first aspect of the present invention, an energy saving control strategy learning method is provided, including:

s11, acquiring initial state parameters of the air conditioner, and determining an initial action value according to the initial state parameters;

s12, executing a control action corresponding to the initial action value, acquiring a target state parameter of the next state of the air conditioner and a generated energy-saving reward value after the control action is executed, and updating a sampling count value;

s13, searching a preset reward table according to the target state parameter to obtain a historical return value of a state action pair formed by the target state parameter and different preset action values, wherein the reward table comprises an energy-saving reward value and a historical return value of the state action pair formed by the state parameter and different preset action values;

s14, selecting a target action value in a state action pair formed by the target state parameters, wherein the probability that the state action pair corresponding to the target action value is the state action pair with the maximum historical return value in the formed state action pair is larger than a preset value;

s15, executing the control action corresponding to the target action value, and acquiring the generated target energy-saving reward value after the control action is executed;

s16, judging whether the sampling count value reaches a preset sampling threshold value;

if the sampling count value does not reach the preset sampling threshold value, repeatedly executing S12-S16, otherwise executing S17;

and S17, respectively counting the sampling mean value of the target energy-saving reward value of each state action pair formed by the target state parameters, taking the obtained sampling mean value as the estimated reward value of the corresponding state action pair, and updating the reward table according to the estimated reward value.

Optionally, after updating the prize table according to the estimated reward value, the method further comprises:

s18, updating an iteration count value, and judging whether the iteration count value reaches a preset iteration threshold value;

if the iteration count value does not reach the preset iteration threshold value, resetting the sampling count value, and repeatedly executing S12-S17, otherwise, ending the learning process.

Optionally, the selecting a target action value in a state action pair formed by the target state parameters includes:

and selecting the target action value in the state action pair formed by the target state parameters by adopting a gentle decision algorithm.

Optionally, the determining an initial action value according to the initial state parameter includes:

searching the reward table according to the initial state parameter;

and if the state action pair formed by the initial state parameters does not exist in the reward table, taking a preset default action value as the initial action value.

Optionally, the method further comprises:

if a state action pair formed by the initial state parameters exists in the reward table, acquiring historical return values of the state action pair formed by the initial state parameters and different preset action values;

and selecting the action value of the state action pair with the maximum historical return value among the state action pairs formed by the initial state parameters, and taking the selected action value as the initial action value.

In a second aspect of the present invention, there is provided a method for implementing air conditioning energy control based on the energy saving control strategy learning method described above, including:

acquiring current state parameters of the air conditioner;

searching a reward table learned by the energy-saving control strategy learning method according to the current state parameter to obtain a historical return value of a state action pair formed by the current state parameter and different preset action values;

selecting an action value of a state action pair with the maximum historical return value among the state action pairs formed by the current state parameters, and taking the selected action value as an optimal action value;

and executing the control action corresponding to the optimal action value to realize the energy-saving control of the air conditioner.

In a third aspect of the present invention, there is provided an energy-saving control strategy learning apparatus, including:

the first decision module is used for acquiring initial state parameters of the air conditioner and determining an initial action value according to the initial state parameters;

the first execution module is used for executing the control action corresponding to the initial action value, acquiring the target state parameter of the next state of the air conditioner and the generated energy-saving reward value after the control action is executed, and updating a sampling count value;

the processing module is used for searching a preset reward table according to the target state parameter so as to obtain a historical reward value of a state action pair formed by the target state parameter and different preset action values, and the reward table comprises an energy-saving reward value and a historical reward value of the state action pair formed by the state parameter and different preset action values;

a second decision module, configured to select a target action value from a state action pair formed by the target state parameters, where a probability that a state action pair corresponding to the target action value is a state action pair with a largest historical return value in the formed state action pair is greater than a preset value;

the second execution module is used for executing the control action corresponding to the target action value and acquiring the generated target energy-saving reward value after the control action is executed;

the first judging module is used for judging whether the sampling count value reaches a preset sampling threshold value or not, and if the sampling count value does not reach the preset sampling threshold value, returning to the first executing module;

and the learning module is used for respectively counting the sampling mean value of the target energy-saving reward value of each state action pair formed by the target state parameters when the judgment result of the judgment module is that the sampling count value reaches a preset sampling threshold value, taking the obtained sampling mean value as the estimated reward value of the corresponding state action pair, and updating the reward table according to the estimated reward value.

Optionally, the learning module is further configured to update an iteration count value after updating the reward table according to the estimated reward value;

the device further comprises:

the second judgment module is used for judging whether the iteration count value reaches a preset iteration threshold value;

the learning module is further configured to reset the sampling count value when the iteration count value does not reach a preset iteration threshold value, return to the first execution module, and end the learning process when the iteration count value reaches the preset iteration threshold value.

Optionally, the first decision module is specifically configured to search the bonus table according to the initial state parameter; if the state action pair formed by the initial state parameters does not exist in the reward table, taking a preset default action value as the initial action value; if the state action pair formed by the initial state parameters exists in the reward table, acquiring a historical return value of the state action pair formed by the initial state parameters and different preset action values, selecting an action value of the state action pair with the maximum historical return value among the state action pairs formed by the initial state parameters, and taking the selected action value as the initial action value.

A fourth aspect of the present invention provides an apparatus for implementing the air conditioning energy control based on the energy saving control strategy learning apparatus as described above, including:

the parameter acquisition module is used for acquiring current state parameters of the air conditioner;

the second processing module is used for searching a reward table learned by the energy-saving control strategy learning device according to the current state parameter so as to obtain a historical return value of a state action pair formed by the current state parameter and different preset action values;

a third decision module, configured to select an action value of a state action pair with a largest historical return value among state action pairs formed by the current state parameter, and use the selected action value as an optimal action value;

and the third execution module is used for executing the control action corresponding to the optimal action value to realize the energy-saving control of the air conditioner.

Furthermore, the invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth in any of the above.

The invention also provides an air conditioning device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of any one of the above methods.

The energy-saving control strategy learning method, the method for realizing the air conditioning energy control and the device thereof adopt the Monte Carlo method and the reinforcement learning method to be combined, the Monte Carlo sampling method is utilized to obtain the approximate solution of the problem, the transferred state and the obtained reward are observed by executing the selected action on the current air conditioning environment, the return value is estimated according to the sample average of the return value of each state by continuously interactive learning on the air conditioning running environment, and finally the optimal control strategy is obtained to achieve the energy-saving control.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a method for learning an energy-saving control strategy according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for learning an energy saving control strategy according to another embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for implementing air conditioning energy control based on an energy-saving control strategy learning method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an energy-saving control strategy learning apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device for implementing air conditioning energy control based on an energy-saving control strategy learning device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Reinforcement learning is receiving increasing attention and is used in the field of artificial intelligence including industrial scheduling and path planning, and can be used for solving the decision-making problems of stochastic or uncertain dynamic system optimization. The reinforcement learning has outstanding significance and wide prospect in theoretical and practical application. The invention uses the reinforcement learning frame to control the air conditioner, the reinforcement learning is different from other algorithms (such as a neural network), the learning target of the reinforcement learning is changed and undefined, even absolute correct labels may not exist, the air conditioner control environment is more complex, the control target of the reinforcement learning is more and is influenced by the environment to be in dynamic change, therefore, the reinforcement learning is used to realize the learning of the energy-saving control strategy, the algorithm continuously and interactively learns the air conditioner operation environment, the optimal control strategy is searched, and the energy-saving control is achieved.

Fig. 1 schematically shows a flowchart of an energy saving control strategy learning method according to an embodiment of the present invention. Referring to fig. 1, the energy-saving control strategy learning method provided in the embodiment of the present invention specifically includes steps S11 to S17, as follows:

and S11, acquiring initial state parameters of the air conditioner, and determining an initial action value according to the initial state parameters.

Specifically, the state parameter of the air conditioner represents the current operating environment parameter of the air conditioner, including the temperature of the inner pipe, the indoor temperature, and the like, and the outdoor environment temperature, and the like.

And S12, executing a control action corresponding to the initial action value, acquiring a target state parameter of the next state of the air conditioner and a generated energy-saving reward value after the control action is executed, and updating a sampling count value n.

In this embodiment, after the control action is executed, the target state parameter and the energy-saving reward value of the next state are acquired and stored in a preset reward table.

The energy-saving reward value represents the comprehensive index of comfort and energy-saving of the air conditioner fed back after the air conditioner is controlled to run by specific actions and is used for guiding the algorithm to adjust the return value, namely the Q value of a state-action value function. The reward is that an action is executed for the air conditioner, and then the air conditioner returns relevant data to calculate the execution quality of the action, so that the algorithm is guided to adjust and output more suitable actions, and energy-saving control is achieved.

S13, searching a preset reward table according to the target state parameter to obtain a historical return value of a state action pair formed by the target state parameter and different preset action values, wherein the reward table comprises an energy-saving reward value and a historical return value of the state action pair formed by the state parameter and different preset action values.

Specifically, the operation value represents a control parameter of the air conditioner, and includes a compressor rotation speed, an electronic expansion valve opening degree, or a combination thereof. In this embodiment, different operation values are preset.

S14, selecting a target action value in a state action pair formed by the target state parameters, wherein the probability that the state action pair corresponding to the target action value is the state action pair with the maximum historical return value in the formed state action pair is larger than a preset value.

In this embodiment, a gentle decision algorithm may be adopted to select the target action value from the state action pair formed by the target state parameters. Specifically, the optimal action is "selected through a state-action value function Q," an action value in a state action pair with the largest historical return value among the state action pairs, and the random action refers to an action value randomly selected from the selectable state action pairs.

And S15, executing the control action corresponding to the target action value, and acquiring the generated target energy-saving reward value after the control action is executed.

And S16, judging whether the sampling count value N reaches a preset sampling threshold value N.

If the sampling count value N does not reach the preset sampling threshold value N, repeatedly executing S12-S16 until the repeated execution is carried out for N times, otherwise, executing S17;

where N denotes a monte carlo one-round sampling process, N denotes a sampling count value of the counter, and N starts counting from 1 in each round.

In this embodiment, the average value of the cumulative reward sampling values of each state-action pair is obtained from the reward table, and is used as the estimated reward value of the corresponding state-action pair, and the reward table is updated according to the estimated reward value, so as to replace the historical reward value of the corresponding state-action pair with the learned estimated reward value.

Further, after the control action corresponding to the initial action value is executed in step S12, if the air conditioner is abnormal, for example, the compressor is turned off, the method records that the air conditioner environment parameter is abnormal, and returns to step S12, otherwise, the method returns to step S13, and continues to execute the subsequent process.

In another embodiment of the present invention, referring to fig. 2, after updating the bonus table according to the estimated reward value, the method further comprises step S18:

and S18, updating an iteration count value, judging whether the iteration count value reaches a preset iteration threshold value, if the iteration count value does not reach the preset iteration threshold value, resetting the sampling count value, namely setting N to 0, and repeatedly executing S12-S17 until the repeated execution is carried out for N times, otherwise, ending the learning process.

In this embodiment, by determining whether the updated iteration count value satisfies the preset iteration threshold, if so, the process is ended; if the execution is not satisfied, the process is repeatedly executed from S12 to S17, and the learning is continued.

The energy-saving control strategy learning method provided by the invention does not need to rely on a Markov process, and can select an interested state to evaluate a function when a model is unknown without traversing all value functions.

In the embodiment of the invention, the algorithm selects corresponding actions each time, and the air conditioner feeds back related data to update the network and select the next action after executing the actions. Thus, the Q value is continuously updated in an iterative manner, and the learning of the model is stopped when the maximum iteration times is reached. And the air conditioner executes the reward table to perform energy-saving optimization control according to the action estimated by the current state. The algorithm can learn the optimal control strategy which accords with the space environment of the current air conditioner according to the specific environment of the air conditioner, namely self-adaptive energy-saving control.

In an embodiment of the present invention, the determining an initial action value according to the initial state parameter specifically includes:

searching the reward table according to the initial state parameter;

Further, if a state action pair formed by the initial state parameters exists in the reward table, acquiring a historical return value of the state action pair formed by the initial state parameters and different preset action values; and selecting the action value of the state action pair with the maximum historical return value among the state action pairs formed by the initial state parameters, and taking the selected action value as the initial action value.

In this embodiment, if the state action pair formed by the initial state parameter does not exist in the reward table, it is proved that the learning is from the nonexistence to the existing learning, and the initial action value is implemented by using a preset default action value. If the state action pair formed by the initial state parameters exists in the reward table, the condition that the learning record exists before is shown, the learning is continued on the basis of the previous learning, and the action value of the state action pair with the maximum historical return value in the state action pair formed by the initial state parameters can be directly selected as the initial action value, and the subsequent learning is continued.

In this embodiment, for unknown state action pairs, the history report value is set to 0 by default.

The energy-saving control strategy learning method adopts a Monte Carlo reinforcement learning algorithm to realize energy-saving control strategy learning, wherein Monte Carlo reinforcement learning is used for evaluating strategies based on sampling, and an algorithm model can better simulate the control of an air conditioner operation environment (model unknown). Through continuous learning, the algorithm can find a more energy-saving control strategy under the condition that the operation of the air conditioner is kept to meet the set conditions. The method solves the problem of high energy consumption of the existing air conditioner, comprehensively controls the operation environment of the air conditioner and solves the problem of simulation optimization of the execution parameters of the air conditioner.

Fig. 3 schematically shows a flowchart of a method for implementing the air conditioning energy control based on the energy-saving control strategy learning method according to an embodiment of the present invention. Referring to fig. 3, the method for implementing air conditioning energy control based on the energy-saving control strategy learning method provided in the embodiment of the present invention specifically includes steps S21 to S24, as follows:

and S21, acquiring the current state parameters of the air conditioner.

S22, searching the reward table learned by the energy-saving control strategy learning method according to the current state parameter to obtain the historical return value of the state action pair formed by the current state parameter and different preset action values.

And S23, selecting the action value of the state action pair with the maximum historical return value among the state action pairs formed by the current state parameters, and taking the selected action value as the optimal action value.

And S24, executing the control action corresponding to the optimal action value to realize the energy-saving control of the air conditioner.

The invention uses the idea of reinforcement learning to continuously learn the air conditioner through dynamic adjustment, thereby finding out the optimal control strategy which is suitable for the environment where the air conditioner is located.

The general reinforcement learning method needs to obtain the state transition probability in the air-conditioning environment, the distribution of the state transition probability needs to accord with the limited Markov process, and the air-conditioning control cannot well meet the requirements. Therefore, the Monte Carlo-based reinforcement learning method is used, the Monte Carlo method does not need to meet the requirements, and the probability distribution of the states is estimated by continuously collecting samples, so that a better control strategy is found, and energy-saving control is achieved.

For simplicity of explanation, the method embodiments are described as a series of acts or combinations, but those skilled in the art will appreciate that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the embodiments of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Fig. 4 schematically shows a structural diagram of an energy saving control strategy learning apparatus according to an embodiment of the present invention. Referring to fig. 4, the energy saving control strategy learning apparatus according to the embodiment of the present invention specifically includes a first decision module 401, a first execution module 402, a processing module 403, a second decision module 404, a second execution module 405, a first judgment module 406, and a learning module 407, where:

a first decision module 401, configured to obtain an initial state parameter of an air conditioner, and determine an initial action value according to the initial state parameter;

a first executing module 402, configured to execute a control action corresponding to the initial action value, obtain a target state parameter of a next state of the air conditioner and a generated energy saving reward value after the control action is executed, and update a sampling count value;

a processing module 403, configured to search a preset reward table according to the target state parameter to obtain a historical reward value of a state action pair formed by the target state parameter and different preset action values, where the reward table includes an energy-saving reward value and a historical reward value of the state action pair formed by the state parameter and different preset action values;

a second decision module 404, configured to select a target action value in a state action pair formed by the target state parameters, where a probability that a state action pair corresponding to the target action value is a state action pair with a largest historical return value in the formed state action pair is greater than a preset value;

a second executing module 405, configured to execute a control action corresponding to the target action value, and obtain a generated target energy saving reward value after the control action is executed;

a first determining module 406, configured to determine whether the sampling count value reaches a preset sampling threshold, and if the sampling count value does not reach the preset sampling threshold, return to the first executing module;

a learning module 407, configured to, when the determination result of the determining module is that the sampling count value reaches a preset sampling threshold value, respectively count sampling mean values of target energy saving reward values of each state action pair formed by the target state parameters, use the obtained sampling mean values as estimated reward values of corresponding state action pairs, and update the reward table according to the estimated reward values.

In an embodiment of the present invention, the learning module 407 is further configured to update an iteration count value after updating the bonus table according to the estimated reward value;

correspondingly, the apparatus further includes a second determining module, not shown in the drawings, configured to determine whether the iteration count value reaches a preset iteration threshold value;

the learning module 407 is further configured to reset the sampling count value when the iteration count value does not reach a preset iteration threshold, and return to the first execution module, and when the iteration count value reaches the preset iteration threshold, end the learning process.

The second decision module 404 is specifically configured to select the target action value in the state action pair formed by the target state parameters by using a soft decision algorithm.

In this embodiment of the present invention, the first decision module 401 is specifically configured to search the bonus table according to the initial state parameter; and if the state action pair formed by the initial state parameters does not exist in the reward table, taking a preset default action value as the initial action value.

Further, the first decision module 401 is specifically configured to, if a state action pair formed by the initial state parameter exists in the reward table, obtain a historical return value of the state action pair formed by the initial state parameter and different preset action values, select an action value of the state action pair with the largest historical return value among the state action pairs formed by the initial state parameter, and use the selected action value as the initial action value.

Fig. 5 is a schematic structural diagram of an apparatus for implementing the air conditioning energy control based on the energy-saving control strategy learning apparatus according to an embodiment of the present invention. Referring to fig. 5, the apparatus for implementing air conditioning energy control based on an energy-saving control strategy learning apparatus in the embodiment of the present invention specifically includes a parameter acquisition module 501, a second processing module 502, a third decision module 503, and a third execution module 504, where:

a parameter acquisition module 501, configured to acquire current state parameters of the air conditioner;

a second processing module 502, configured to search, according to the current state parameter, a reward table learned by using the energy-saving control policy learning apparatus, so as to obtain a historical return value of a state action pair formed by the current state parameter and different preset action values;

a third decision module 503, configured to select an action value of a state action pair with a largest historical return value among the state action pairs formed by the current state parameter, and use the selected action value as an optimal action value;

and a third executing module 504, configured to execute the control action corresponding to the optimal action value, so as to implement energy saving control of the air conditioner.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method according to any of the above embodiments.

In this embodiment, the module/unit integrated with the air-conditioning energy-saving control device or the device for realizing air-conditioning energy control based on the energy-saving control strategy learning device may be stored in a computer-readable storage medium if it is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The air conditioning equipment provided by the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the energy-saving control strategy learning method embodiment or realize the steps in the method embodiment of the air conditioning energy control based on the energy-saving control strategy learning method. Alternatively, when the processor executes the computer program, the processor implements the functions of each module/unit in the energy saving control policy learning apparatus embodiment, for example, the first decision module 401, the first execution module 402, the processing module 403, the second decision module 404, the second execution module 405, the first judgment module 406, and the learning module 407 shown in fig. 4, or implements the functions of each module/unit in the apparatus embodiment that implements the air conditioning energy control based on the energy saving control policy learning apparatus, for example, the parameter collection module 501, the second processing module 502, the third decision module 503, and the third execution module 504 shown in fig. 5.

Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the air-conditioning energy-saving control apparatus or the execution process in the apparatus for implementing air-conditioning energy control based on the energy-saving control strategy learning apparatus.

The air conditioning equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The air conditioning device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the air conditioning apparatus in the present embodiment may include more or less components, or combine some components, or different components, for example, the air conditioning apparatus may further include an input-output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the air conditioning apparatus and connecting the various parts of the entire air conditioning apparatus with various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the air conditioning equipment by operating or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An energy-saving control strategy learning method is characterized by comprising the following steps:

s13, searching a preset reward table according to the target state parameter to obtain a historical return value of a state action pair formed by the target state parameter and different preset action values, wherein the reward table comprises an energy-saving reward value and a historical return value of the state action pair formed by the target state parameter and different preset action values;

2. The method of claim 1, wherein after updating the rewards table according to the estimated reward value, the method further comprises:

3. The method of claim 1, wherein selecting a target action value in a state action pair formed by the target state parameters comprises:

4. The method according to any of claims 1-3, wherein said determining an initial action value from said initial state parameter comprises:

searching the reward table according to the initial state parameter;

5. The method of claim 4, further comprising:

6. A method for implementing the air conditioning energy control based on the energy saving control strategy learning method according to any one of claims 1 to 5, comprising:

acquiring current state parameters of the air conditioner;

7. An energy-saving control strategy learning device, comprising:

the processing module is used for searching a preset reward table according to the target state parameter so as to obtain a historical reward value of a state action pair formed by the target state parameter and different preset action values, and the reward table comprises an energy-saving reward value and a historical reward value of the state action pair formed by the target state parameter and different preset action values;

8. The apparatus of claim 7, wherein the learning module is further configured to update an iteration count value after updating the rewards table based on the estimated reward value;

the device further comprises:

9. The apparatus according to claim 7 or 8, wherein the first decision module is specifically configured to look up the bonus table according to the initial state parameter; if the state action pair formed by the initial state parameters does not exist in the reward table, taking a preset default action value as the initial action value; if the state action pair formed by the initial state parameters exists in the reward table, acquiring a historical return value of the state action pair formed by the initial state parameters and different preset action values, selecting an action value of the state action pair with the maximum historical return value among the state action pairs formed by the initial state parameters, and taking the selected action value as the initial action value.

10. An apparatus for implementing the idling control power control based on the energy-saving control strategy learning apparatus according to any one of claims 7 to 9, comprising:

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5 or the steps of the method according to claim 6.

12. An air conditioning apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method according to any one of claims 1 to 5 or the steps of the method according to claim 6.