CN116880169A

CN116880169A - Peak power demand prediction control method based on deep reinforcement learning

Info

Publication number: CN116880169A
Application number: CN202310744068.XA
Authority: CN
Inventors: 傅启明; 刘璐; 马杰; 陈建平; 陆悠
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-10-13

Abstract

The application relates to the technical field of building energy conservation, and discloses a peak power demand prediction control method based on deep reinforcement learning, which comprises the following steps: acquiring data samples of four buildings in a period of time as a data set, dividing control actions of building energy consumption into M sections with the same size, and dividing the data into 8:2 into training set and test set, constructing a deep forest module, constructing a first deep reinforcement learning module, predicting and classifying new energy consumption data by using the trained model, constructing a second deep reinforcement learning module, and predicting the next action a by using an Actor network _t Through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load. The method avoids human factors and limitations existing in the traditional method, and has better flexibility and adaptability.

Description

Peak power demand prediction control method based on deep reinforcement learning

Technical Field

The application relates to the technical field of building energy conservation, in particular to a peak power demand prediction control method based on deep reinforcement learning.

Background

The construction industry is one of the fields with the greatest energy consumption worldwide, and management and optimization of construction energy consumption have become urgent. Peak power demand control is an important strategy in building energy management and optimization. By implementing peak power demand control, energy consumption of the building can be reduced, and energy utilization efficiency and management level of the building can be improved. However, there are difficulties in achieving control of peak power demand.

Conventional rule-based control strategies suffer from a number of limitations, such as being unable to accommodate complex environmental and demand changes. Although deep reinforcement learning techniques have been widely used in recent years in peak power demand prediction and control, existing research has certain drawbacks. First, the traditional deep reinforcement learning method has high calculation cost when processing the continuous state space, which results in slow algorithm convergence. Second, existing studies tend to focus only on control of peak power demand, and ignore the importance of predictions for optimizing control strategies. Therefore, innovations in prediction and control are needed to improve the efficiency and accuracy of energy consumption management and optimization.

Disclosure of Invention

The application aims to provide a peak power demand prediction control method based on deep reinforcement learning, which aims to solve the problems that the traditional rule-based control strategy proposed in the background art has a plurality of limitations, such as incapability of adapting to complex environmental changes, demand changes and the like. Although deep reinforcement learning techniques have been widely used in recent years in peak power demand prediction and control, existing research has certain drawbacks. First, the traditional deep reinforcement learning method has high calculation cost when processing the continuous state space, which results in slow algorithm convergence. Second, existing studies tend to focus only on the control of peak power demand, and ignore the issue of predicting the importance of optimizing control strategies.

In order to achieve the above purpose, the present application provides the following technical solutions: a peak power demand prediction control method based on deep reinforcement learning comprises the following steps:

step one, simulating and acquiring data samples of four buildings in a period of time by using energy plus as a data set;

dividing the control action of the building energy consumption into M sections with the same size to obtain different action spaces, and dividing the action control space into a plurality of equidistant sections to convert the discrete action control space into a continuous numerical range so that the discrete action control space can be better processed and modeled;

step three, data are processed according to 8:2, dividing the ratio into a training set and a testing set, and reconstructing energy consumption data in the lower range of the training set; classifying and labeling the energy consumption data, wherein the label interval is [1, M ], forming a new sample and a label, and normalizing the new sample and the label;

step four, constructing a depth forest module, taking the data set in the step three as the input of the depth forest module, and training a depth forest classifier; after the classifier training is completed, the normalized sample is used as an original characteristic vector to be re-transmitted into the classifier; obtaining a transformed feature vector through multi-granularity scanning; the cascade forest structure in the depth forest takes the transformed feature vector as input and outputs the probability of each action category corresponding to the data;

step five, constructing a first deep reinforcement learning module for predicting energy consumption data; combining the input normalized newly constructed sample with the action interval category probability output by the depth forest module to serve as the input of the Q neural network; the Q neural network calculates the Q values of all actions, and calculates a target Q value through a target Q network; the TD error between the two is calculated to update the parameters of the Q network;

step six, predicting and classifying new energy consumption data by using the trained model, and comparing and verifying the new energy consumption data with an actual observed value to evaluate the generalization capability and the prediction precision of the model;

step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load; at each time step t, the Agent predicts building clusters using a deep reinforcement learning module in combination with a deep forestAnd combine it with the current building status, weather and time to form a new status tuple s _t Inputting another deep reinforcement learning module; the Agent selects an action a based on the state tuple _t The peak load of the whole system is influenced by controlling energy storage equipment in four buildings;

step eight, agent obtains new state tuple s _t Predicting next action a using an Actor network _t ；

Step nine, through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load.

Preferably, in the third step, the training set range data is subjected to sample and label reconstruction, proper attributes are required to be selected as characteristics, and proper first n pieces of historical energy consumption data are selected as characteristics through cross verification; then for time t, will [ E ] _t-n ,E _t-n-1 …,E _t-1 ]As a new sample, E _t For its corresponding new tag.

Preferably, in the fifth step, the algorithm minimizes the average mean square error between the Q network and the target Q network by gradient descent, so as to optimize the training effect of the model.

Preferably, in step eight, the Actor network will s _t As input, output an a _t Then Agent uses the probability distribution to sample an action a _t The method comprises the steps of carrying out a first treatment on the surface of the Next, agent will a _t As input, combine the current state s _t Calculating a target Q value Q through a Critic network _target (s _t ,a _t ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the Agent uses an Adam optimization algorithm to update the parameters of the Actor network and the Critic network so as to maximize the target Q value; in the optimization process, in order to prevent the oscillation of network parameters, the parameters of the target Actor network and the target Critic network are updated by using a soft update strategy.

Preferably, the original large prediction space is divided into N subspaces by a depth forest module, the actions in each subspace are expressed by a unified formula, the formula ingeniously utilizes the property of a general term to compress the action space, and each action in the compression space is expressed as one action in the whole subspace;

in the formula, x and z represent an upper limit and a lower limit of the action space, respectively, and N represents a final value of the compression space; by compressing the action space in this way, the size of the action space can be greatly reduced to cope with the problem of the reduction of prediction accuracy caused by the large prediction space.

Preferably, in the fifth step, modeling the energy consumption prediction problem as MDP modeling, and constructing corresponding states, actions and immediate rewards functions;

wherein:

status: denoted by s; s is(s) _t Consists of the normalized sample in the third step and the probability output by the depth forest module in the fourth step, namely

The actions are as follows: a is used for representing that each action corresponds to an energy consumption predicted value;

immediate rewards function: denoted by r; at time t, a _t For the predicted value of energy consumption, the absolute value of the difference value between the predicted value of energy consumption and the actual energy consumption value can be regarded as rewards obtained by the agent at the time t, and the predicted value of energy consumption and the actual energy consumption value are expressed as follows:

R ₁ ＝|En _pre -En _true |。

preferably, in the fifth step, the update parameter θ is updated by using TD errors of both the Q network and the target Q network, specifically:

where (s, a, r, s ') is a quadruple obtained from the experience pool, a' is an action performed by the agent at time t+1,and theta _i The parameters of the target Q network and the Q network are respectively represented, and r is the state s at the moment t _t Lower execution action a _t The acquired prize.

Preferably, in step seven, modeling the control problem of the energy storage device in the building as MDP modeling, and constructing corresponding states, actions and immediate rewards functions;

wherein:

status: the state variable of the control system mainly consists of two parts; the first part comprises state variables of the cluster building, which are divided into time variables, area related variables and building related variables; the time variable comprises month, time and day types; the area related variables include weather information and electricity prices, including outdoor dry bulb temperature, relative humidity, direct and diffuse solar radiation, solar power generation, and predicted outdoor temperatures and humidities for 5-8 hours and 11-15 hours into the future; building related variables include indoor temperature, indoor humidity, and use of non-mobile devices; the second part comprises dynamic state variables such as coefficient of performance of the heat pump, state of charge of the hot and cold water tanks, and building energy consumption by the predicted next time step t;

the actions are as follows: the energy storage system under each building consists of two controllable units which respectively represent hot water storage tanks and cold water storage tanks; in order to ensure that the shortage of energy supply and demand does not occur, the upper and lower limits of the action space are set to be 1/3 of the maximum energy storage capacity, and the action space is expressed as { a } ₁₁ ,a ₁₂ ,a ₂₁ ,a ₂₂ ,a ₃₁ ,a ₃₂ ,a ₄₁ ,a ₄₂ }；

Immediate rewards function: the bonus function of the control section should take into account the power peak regulation effect and the power cost, as they both affect the quality of the timing control of the system; the quality of the electric power peak regulation is mainly reflected on energy consumption variables in the rewarding function, and the cost judgment is based on the influence of the current price; thus, the bonus function is designed as follows:

R ₂ ＝α*En _t +β*[(En _t /10) ³ ]*Pr _t

wherein En is _t Represents the current power demand value, which has been smoothed to increaseCalculation accuracy, pr _t A current electricity price representing time t; the reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost; wherein the set values of a and beta are 0.8 and 0.2, respectively.

Preferably, in the updating of the eighth Actor network, action a is taken by maximizing the current state _t The Q value obtained, i.e. maxQ (s _t ,a _t ) Updating parameters of the Actor network; the updating process uses a gradient rising method, so that the quality of a strategy of an Actor network can be gradually improved; the method comprises the following steps:

critic network whose goal is to minimize the error of the predicted Q value from the true Q value, i.e., training parameters of Critic network by minimizing mean square error using the predicted Q value as the goal, TD goal is defined as y _i ＝r+γQ(s',μ(s'|θ ^μ′ )|θ ^Q ) The method is characterized by comprising the following steps:

L＝1/N∑ _i (y _i -Q(s _i ,a _i |θ ^Q )) ²

the target Actor network and the target Critic network both adopt a soft update method to ensure the stability of the algorithm instead of directly copying network parameters, and the method is as follows:

the application has the beneficial effects that:

the method can efficiently solve the problems of building load prediction and energy storage equipment control. According to the method, future load conditions can be predicted through learning building load data, and an optimal control strategy is generated according to the real-time state of the energy storage equipment and the predicted conditions of the building load. Compared with the traditional rule and experience-based control method, the method does not need to manually design a control strategy, and can automatically learn an optimal strategy according to data, so that human factors and limitations in the traditional method are avoided, and the method has better flexibility and adaptability. Meanwhile, the method can effectively reduce the energy consumption cost of the building, improves the energy utilization efficiency, and plays an important role in promoting the realization of the aim of intelligent energy.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of the overall architecture of the present application;

FIG. 2 is an enlarged view of the data set generation architecture of FIG. 1 in accordance with the present application;

FIG. 3 is an enlarged view of the prediction phase architecture of FIG. 1 according to the present application;

FIG. 4 is an enlarged view of the Control phase architecture of FIG. 1 according to the present application;

FIG. 5 is an enlarged view of the data preprocessing architecture of the prediction stage of FIG. 2 according to the present application;

FIG. 6 is an enlarged view of the depth forest classifier architecture of the prediction stage of FIG. 2 of the present application;

fig. 7 is an enlarged view of the DQN-based energy consumption prediction architecture of the prediction stage of fig. 2 according to the present application.

Detailed Description

In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the application, whereby the application is not limited to the specific embodiments disclosed below.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Examples

As shown in fig. 1-7, an embodiment of the present application discloses a peak power demand prediction control method based on deep reinforcement learning, which includes the following steps:

step one, referring to a commercial reference building developed by the United states department of energy (DOE), simulating and acquiring data samples of four buildings for a period of time as a data set by using energy plus;

step three, data are processed according to 8:2 is divided into a training set and a testing set, and the energy consumption data in the range under the training set is reconstructed. Classifying and labeling the energy consumption data, wherein the label interval is [1, M ], forming a new sample and a label, and normalizing the new sample and the label;

and step four, constructing a depth forest module, taking the data set in the step three as the input of the depth forest module, and training a depth forest classifier. And after the classifier training is finished, the normalized sample is used as an original characteristic vector to be re-transmitted into the classifier. And obtaining the transformed feature vector through multi-granularity scanning. The cascade forest structure in the depth forest takes the transformed feature vector as input and outputs the probability of each action category corresponding to the data;

and fifthly, constructing a first deep reinforcement learning module for predicting energy consumption data, wherein the deep reinforcement learning module adopts a DQN architecture, and combines the newly constructed sample after input normalization with action interval category probability output by the deep forest module to serve as input of the Q neural network. And the Q neural network calculates the Q values of all actions, and calculates a target Q value through the target Q network. The TD error between the two is calculated to update the parameters of the Q network. Specifically, the algorithm minimizes the average mean square error between the Q network and the target Q network by gradient descent, thereby optimizing the training effect of the model, specifically expressed as:

where (s, a, r, s ') is a quadruple obtained from the experience pool, a' is an action performed by the agent at time t+1,and theta _i The parameters of the target Q network and the Q network are respectively represented, and r is the state s at the moment t _t Lower execution action a _t The acquired prize. Specifically, this section models the energy consumption prediction problem as MDP modeling, and constructs the corresponding state, action, and immediate rewards functions.

Wherein:

(a) Status: denoted by s. s is(s) _t Consists of the normalized sample in the third step and the probability output by the depth forest module in the fourth step, namely

(b) The actions are as follows: denoted by a, each action corresponds to an energy consumption predictor.

(c) Immediate rewards function: denoted by r. At time t, a _t For the predicted value of energy consumption, the absolute value of the difference value between the predicted value of energy consumption and the actual energy consumption value can be regarded as rewards obtained by the agent at the time t, and the predicted value of energy consumption and the actual energy consumption value are expressed as follows:

R ₁ ＝|En _pre -En _true |

and step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load. At each time step t, the Agent predicts the future energy demand of the building group using a deep reinforcement learning module in combination with the deep forest and combines it with the current building state, weather and time to form a new state tuple s _t Another deep reinforcement learning module is input. The Agent selects an action a based on the state tuple _t The peak load of the overall system is affected by controlling the energy storage devices (hot and cold water storage tanks) in four buildings. Specifically, this section models control problems of energy storage devices in a building as MDP modeling, and constructs corresponding status, action, and immediate rewards functions.

Wherein:

(a) Status: the state variables of the control system are mainly composed of two parts. The first part includes state variables of the clustered building, divided into time, area-related and building-related variables. The time variable comprises month, time and day types; the area related variables include weather information and electricity prices including outdoor dry bulb temperature, relative humidity, direct and diffuse solar radiation, solar power generation, and outdoor temperatures and humidities predicted for 5-8 hours and 11-15 hours in the future, with outdoor temperatures and humidities of 6 hours and 12 hours being preferred; building related variables include indoor temperature, indoor humidity, and use of non-mobile devices. The second part includes dynamic state variables such as coefficient of performance (COP) of the heat pump, state of charge (SOC) of the hot and cold water tanks, and building energy consumption by the predicted next time step t.

(b) The actions are as follows: the energy storage system under each building consists of two controllable units, representing hot water and cold water storage tanks, respectively. In order to ensure that the shortage of energy supply and demand does not occur, the upper and lower limits of the action space are set to be 1/3 of the maximum energy storage capacity, and the action space is expressed as { a } ₁₁ ,a ₁₂ ,a ₂₁ ,a ₂₂ ,a ₃₁ ,a ₃₂ ,a ₄₁ ,a ₄₂ }。

(c) Immediate rewards function: the bonus function of the control section should take into account the power peak regulation effect and the power cost, as they both affect the quality of the timing control of the system. The quality of the electric peak regulation is mainly reflected on energy consumption variables in the rewarding function, and the cost judgment is based on the influence of the current price. Thus, the bonus function is designed as follows:

R ₂ ＝α*En _t +β*[(En _t /10) ³ ]*Pr _t

wherein En is _t Represents the current power demand value, which is smoothed to improve the calculation accuracy, pr _t Representing the current electricity price at time t. The reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost. Wherein, the set values of alpha and beta are respectively 0.8 and 0.2.

Step eight, the control module adopts a DDPG architecture, and the Agent obtains a new state tuple s _t Predicting next action a using an Actor network _t . Specifically, the Actor network will s _t As input, output an a _t Then Agent uses the probability distribution to sample an action a _t . Next, agent will a _t As input, combine the current state s _t Calculating a target Q value Q through a Critic network _target (s _t ,a _t ). Finally, the Agent uses Adam optimization algorithm to update the parameters of the Actor network and the Critic network to maximize the target Q value. Action a is taken in the update of the Actor network by maximizing the current state _t The Q value obtained, i.e. maxQ (s _t ,a _t ) To update the parameters of the Actor network. This update procedure uses a gradient-increasing approach, enabling the Actor network to gradually improve the quality of its policies. The method comprises the following steps:

critic networks are similar to the Q networks in step five, with the goal of minimizing the error of the predicted Q value from the true Q valueI.e. training parameters of Critic networks by minimizing mean square error using predicted Q values as targets, TD targets are defined as y _i ＝r+γQ(s',μ(s'|θ ^μ′ )|θ ^Q ) The method is characterized by comprising the following steps:

L＝1/N∑ _i (y _i -Q(s _i ,a _i |θ ^Q )) ²

Specifically, the specific algorithm flow of the whole predictive control method is as follows:

s1, initializing a state class M, wherein the state class M corresponds to the number of sample classifications;

s2, initializing an experience pool D1 and initializing an experience pool D2;

s3, initializing Q function Q ₁ Objective function Q' ₁ ；

S4, initializing an Actor network and a target Actor network, wherein the Actor network parameters theta _a Target Actor network parameter θ _a' The method comprises the steps of carrying out a first treatment on the surface of the Initializing a Critic network and a target Critic network, critic network parameters θ _c Target Critic network parameter θ _c' ；

S5, dividing the data set, reconstructing the data in the training set range, forming a new sample and a label value, and normalizing the data;

s6, training a depth forest classifier;

s7, setting a training cycle number, entering training, and carrying out steps S8 to S17 for each round of epoode;

s8, randomly selecting a data sample, classifying the sample by using a trained depth forest classifier, and outputtingClass probability and constructing new state s by using the class probability and original sample _1t ；

S9, using the current state S _1t According to the action value Q function Q ₁ Calculate Q values for all actions and select action a using an E-greedy policy ₁ ；

S10, executing action a ₁ A new state s is observed _1t+1 And instant rewards R ₁ Experience(s) _1t ,a ₁ ,R ₁ ,s _1t+1 ) Storing the data into a playback memory D1;

s11, randomly extracting a batch of experiences from the experience pool D1, and updating the action value function Q by using experience data ₁ ；

S12, updating the objective function Q 'once every n steps' ₁ ；

S13, if the prediction precision reaches a preset percentage, constructing a new state S by using the prediction value _2t ；

S14, the Actor network is based on the new state S _2t And selects action a using an e-greedy policy ₂ ；

S15, executing action a ₂ A new state S is observed _2t+1 And instant rewards R ₂ Experience(s) _2t ,a ₂ ,R ₂ ,S _2t+1 ) Store in experience pool D2;

s16, randomly sampling a batch of experiences from the experience pool D2, and updating the Critic network parameter theta by using the experiences _c And Actor network parameter θ _a ；

S17, updating the target Critic network parameter theta by using the SoftUpdate at intervals _c' And target Actor network parameters θ _a' 。

Those skilled in the art will appreciate that the features recited in the various embodiments of the application and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the application. In particular, the features recited in the various embodiments of the application and/or in the claims can be combined in various combinations and/or combinations without departing from the spirit and teachings of the application. All such combinations and/or combinations fall within the scope of the application.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the application thereto, but to limit the application thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the application.

Claims

1. A peak power demand prediction control method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load; at each time step t, the Agent predicts the future energy demand of the building group using a deep reinforcement learning module in combination with the deep forest and combines it with the current building state, weather and time to form a new state tuple s _t Inputting another deep reinforcement learning module; the Agent selects an action a based on the state tuple _t The peak load of the whole system is influenced by controlling energy storage equipment in four buildings;

2. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in the third step, reconstructing samples and labels of training set range data, selecting proper attributes as features, and selecting proper first n pieces of historical energy consumption data as features through cross verification; then for time t, will [ E ] _t-n ,E _t-n-1 …,E _t-1 ]As a new sample, E _t For its corresponding new tag.

3. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in the fifth step, the algorithm minimizes the average mean square error between the Q network and the target Q network through gradient descent, so that the training effect of the model is optimized.

4. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in step eight, the Actor network will s _t As input, output an a _t Then Agent uses the probability distribution to sample an action a _t The method comprises the steps of carrying out a first treatment on the surface of the Next, agent will a _t As input, combine the current state s _t Calculating a target Q value O through a Critic network _target (s _t ,a _t ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the Agent uses an Adam optimization algorithm to update the parameters of the Actor network and the Critic network so as to maximize the target Q value; in the optimization process, in order to prevent the oscillation of network parameters, the parameters of the target Actor network and the target Critic network are updated by using a soft update strategy.

5. The method for predictive control of peak power demand based on deep reinforcement learning of claim 3, wherein: dividing the original large prediction space into N subspaces through a depth forest module, wherein the actions in each subspace are expressed by a unified formula, the formula ingeniously utilizes the property of a general term to compress the action space, and each action in the compression space is expressed as one action in the whole subspace;

6. The method for predictive control of peak power demand based on deep reinforcement learning of claim 5, wherein: modeling the energy consumption prediction problem as MDP modeling, and constructing corresponding states, actions and immediate rewarding functions;

wherein:

R ₁ ＝|En _pre -En _true |。

7. the method for predictive control of peak power demand based on deep reinforcement learning of claim 6, wherein: in the fifth step, updating the update parameter θ by using TD errors of the Q network and the target Q network, specifically:

8. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: modeling a control problem of energy storage equipment in a building as MDP modeling, and constructing corresponding states, actions and immediate rewarding functions;

wherein:

R ₂ ＝α*En _t +β*[(En _t /10) ³ ]*Pr _t

wherein En is _t Represents the current power demand value, which is smoothed to improve the calculation accuracy, pr _t A current electricity price representing time t; the reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost; wherein alpha isAnd β are set to 0.8 and 0.2, respectively.

9. The method for predictive control of peak power demand based on deep reinforcement learning of claim 4, wherein: step eight Actor network update, action a is taken by maximizing current state _t The Q value obtained, i.e. max Q (s _t ,a _t ) Updating parameters of the Actor network; the updating process uses a gradient rising method, so that the quality of a strategy of an Actor network can be gradually improved; the method comprises the following steps:

critic network whose goal is to minimize the error of the predicted Q value from the true Q value, i.e., training parameters of Critic network by minimizing mean square error using the predicted Q value as the goal, TD goal is defined as y _i ＝r+γQ(s',μ(s'|θ ^μ )|θ ^Q ) The method is characterized by comprising the following steps:

L＝1/N∑ _i (y _i -Q(s _i ,a _i |θ ^Q )) ²