CN111367172B - Hybrid system energy management strategy based on reverse deep reinforcement learning - Google Patents

Hybrid system energy management strategy based on reverse deep reinforcement learning Download PDF

Info

Publication number
CN111367172B
CN111367172B CN202010131644.XA CN202010131644A CN111367172B CN 111367172 B CN111367172 B CN 111367172B CN 202010131644 A CN202010131644 A CN 202010131644A CN 111367172 B CN111367172 B CN 111367172B
Authority
CN
China
Prior art keywords
neural network
action
network
evaluation
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010131644.XA
Other languages
Chinese (zh)
Other versions
CN111367172A (en
Inventor
李梓棋
赵克刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010131644.XA priority Critical patent/CN111367172B/en
Publication of CN111367172A publication Critical patent/CN111367172A/en
Application granted granted Critical
Publication of CN111367172B publication Critical patent/CN111367172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The invention discloses a hybrid system energy management strategy based on reverse deep reinforcement learning. The strategy comprises the following steps: calculating a global optimization SOC result by using an optimization solution method as expert knowledge; creating a reward neural network; obtaining parameters of a reward neural network by using the expert knowledge of the reverse reinforcement learning; creating an action neural network and evaluating the neural network; setting an SOC value before vehicle interaction; inputting the acquired SOC value before interaction into a reward neural network to obtain a reward value; inputting the acquired SOC value before interaction into an action neural network to obtain a mode distribution ratio; interacting with the environment by using a mode distribution ratio to obtain an interacted SOC value; inputting the SOC value before interaction, the mode distribution ratio, the reward value and the SOC value after interaction into an evaluation neural network to obtain an evaluation value; and the intelligent agent respectively calculates the gradient of each network and reversely transmits and updates the network parameters until the training is finished. The invention can learn the optimal reward function from expert knowledge, so that the deep reinforcement learning effect is better.

Description

Hybrid system energy management strategy based on reverse deep reinforcement learning
Technical Field
The invention relates to the field of hybrid system energy management, in particular to a hybrid system energy management strategy based on reverse deep reinforcement learning.
Background
The Hybrid Electric power coupling device couples power of a plurality of power sources such as an internal combustion engine and an Electric motor in a Hybrid Electric Vehicle (HEV), reasonably distributes power, and transmits the power to a drive axle to drive a Vehicle. It can be seen as a complex system combining mechanical, electrical, chemical and thermodynamic considerations, by integrating one or more electric motors into a transmission to form a set of automatic transmission systems with electric motors. China's automobile oil consumption regulation puts high requirements on energy conservation and emission reduction of whole automobile enterprises in the next 10-15 years. With the further increase of the energy saving and emission reduction pressure of the automobile, more automobile types (such as medium and large SUVs and MPVs) select a hybrid power scheme, and higher comprehensive requirements on the aspects of dynamic property, energy consumption, pure electric driving capability and the like are provided. Through the optimized arrangement of the mode execution sequence, the potential of energy conservation and emission reduction of the hybrid electric vehicle can be exerted to a greater extent, and the traditional limitation is broken for further improving the dynamic property, the energy consumption and the pure electric drive comprehensive performance of the whole vehicle. And the global optimization energy management strategy can well explore the energy-saving and emission-reducing potential of the hybrid power assembly configuration. However, the global optimization energy management strategy is only aimed at solving known working conditions and cannot be applied online, while the traditional reinforcement learning method is good at solving tasks with limited states and action spaces, but has no energy in solving the problem with high dimensionality of the states and the action spaces (cheng-he-li, cao-shi-ho-ming, li-chen-xi, xu-xiong-deep reverse reinforcement learning research reviews [ J ]. computer engineering and application, 2018,54(05): 24-35.).
Disclosure of Invention
In order to accelerate the solving speed of the global optimization energy management strategy, the invention provides a hybrid system energy management strategy based on reverse deep reinforcement learning.
The purpose of the invention is realized by at least one of the following technical solutions.
A hybrid system energy management strategy based on reverse deep reinforcement learning comprises the following steps:
s1: calculating a global hybrid mode distribution ratio and a global optimized SOC result by using an optimization solving method under one complete working condition, and forming an expert state-action pair as expert knowledge of reverse reinforcement learning;
s2: creating a reward function neural network and initializing parameters;
s3: learning to obtain parameters of the neural network of the reward function by utilizing reverse reinforcement learning;
s4: establishing an action neural network, evaluating the neural network, and initializing parameters of each network; the action neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure; the evaluation neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure;
s5: under one random working condition, acquiring an SOC value s before vehicle interaction;
s6: inputting the acquired SOC value s before interaction into a reward function neural network to obtain a reward value r;
s7: inputting the acquired SOC value s before interaction into an action neural network, and then, outputting a mixed mode distribution ratio a by the action neural network through processing of a plurality of hidden layers;
s8: controlling the vehicle to interact with the environment by using the hybrid mode distribution ratio a obtained in the step S7, and acquiring an SOC value S' after interaction;
s9: combining the SOC value s before interaction, the hybrid mode distribution ratio a, the reward value r and the SOC value s 'after interaction to obtain an experience vector (s, a, r, s'), and then storing the experience vector in a memory buffer;
s10: when the number of the experience vectors in the memory buffer reaches the maximum capacity, randomly extracting a set number of experience vectors from the memory buffer as the input of an evaluation neural network, and then outputting an evaluation value by the evaluation neural network according to a Bellman equation through the processing of a plurality of hidden layers;
s11: the reward function neural network, the action neural network and the evaluation neural network calculate respective weight gradients, and then parameters of the reward function neural network, the action neural network execution network and the evaluation neural network execution network are updated and executed through back propagation;
s12: updating parameters of an execution network of an action neural network and parameters of an execution network of the evaluation neural network by using soft replacement rules, and finishing the updating of the parameters of the action neural network and the evaluation neural network;
s13: after the parameters of the action neural network and the evaluation neural network are updated, repeating the steps S5-S12 until the set maximum step number is reached or the set convergence target is reached, finishing the training of the reward function neural network, the action neural network and the evaluation neural network, and storing the parameters of the neural network after the training is finished;
s14: the method comprises the steps of controlling a controlled object by using a trained reward function neural network, an action neural network and an evaluation neural network, firstly reading parameters of the neural networks after training is finished, then obtaining an SOC value s before interaction, inputting the trained action neural network, and outputting a distribution ratio a of a hybrid power system as a control quantity to control a vehicle.
Further, in step S1, the optimization solution method includes pseudo-spectrum method, dynamic programming method, and genetic algorithm.
Further, in step S2, the reward function neural network is composed of a fully-connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
Further, in step S3, the reverse reinforcement learning method specifically includes the following steps:
s3.1, selecting a strategy pi obtained through random generation, and then operating the strategy to obtain a series of state-action pair sequences zeta { (S)t,at)},stRepresenting the t-th state, a, in a state-action pairtRepresenting the t-th action in the state-action pair, t representing the serial number of the state-action pair, and then calculating the action value Q through an evaluation neural networkπ
S3.2, using the expert State-action pairs obtained in step S1, according to the formula
Figure BDA0002395921990000041
Calculating expert action values
Figure BDA0002395921990000042
Wherein, thetaTParameter, mu, representing a network of reward functionsπSymbols representing strategies, E representing the mathematical expectation, gammatIndicating the discount rate, St0Indicating the tth of an expert State-action pair0A state; a. thet0Indicating the tth of an expert State-action pair0Step action, and a is the t th dotted finger in the formula0Step expert knowledge of the actions actually performed;
s3.3, by
Figure BDA0002395921990000043
Performing gradient descent update on the parameter theta for the target function; wherein the content of the first and second substances,
Figure BDA0002395921990000044
indicating the state of the first summation to the t-th summation and the second summation to the i-th summation,
Figure BDA0002395921990000045
representing the action of summing the first time to the t-th time, summing the second time to the i-th time,Nindicates the total number of times of the second summation,Lrepresenting the total number of first summations; lambda [ alpha ]1The representation is an empirical constant used for balancing punishment and expectation, epsilon represents an artificially set precision threshold, and i represents a counting serial number of the second summation; if the learned state action pairs are consistent with the expert strategy, the loss function
Figure BDA0002395921990000051
Otherwise
Figure BDA0002395921990000052
Further, in step S4, the action neural network and the evaluation neural network are all composed of a fully connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
Further, in step S5, the SOC value S before vehicle interaction is obtained through a vehicle model or preset.
Further, in step S7, the hidden layer includes a full-connection layer, a convolutional layer, a long-short term memory layer, a sigmoid layer, and a tanh layer, that is, the hidden layer is one of the full-connection layer, the convolutional layer, the long-short term memory layer, the sigmoid layer, and the tanh layer.
Further, in step S10, the bellman equation is:
Figure BDA0002395921990000053
wherein r istIndicating a prize value, st+1E represents the state obeying distribution E,
Figure BDA0002395921990000054
denotes the reward as rt, the state obeys the mathematical expectation of the distribution E, r(s)t,at) Represents a state of stThe action is atThe value of the prize of the time of day,
Figure BDA0002395921990000055
representing the mathematical expectation when the action execution policy is pi.
Further, in step S11, the reward function neural network, the action neural network, and the evaluation neural network calculate respective weight gradients through gradient chain equations (Learning representations by back-providing errors, David E, Nature vol323, 1986).
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the optimal reward function can be learned from expert knowledge by utilizing reverse deep reinforcement learning, and the reward function does not need to be artificially designed, so that the deep reinforcement learning effect is better. Compared with a simple optimization solving method, the method has the advantages of high calculating speed and good real-time performance, and is suitable for online application on a real vehicle.
Drawings
FIG. 1 is a schematic overall flow chart of a hybrid system energy management strategy based on reverse deep reinforcement learning according to the present invention;
fig. 2 is a flowchart of a hybrid system energy management strategy method based on reverse deep reinforcement learning according to an embodiment of the present invention.
Detailed Description
For a better understanding of the objects, aspects and advantages of the present invention, reference is made to the following detailed description of the invention taken in conjunction with the accompanying drawings.
Example (b):
as shown in fig. 1 and fig. 2, a hybrid system energy management strategy based on reverse deep reinforcement learning includes the following steps:
s1: calculating a global hybrid mode distribution ratio and a global optimized SOC result by using an optimization solving method under one complete working condition, and forming an expert state-action pair as expert knowledge of reverse reinforcement learning; the optimization solving method comprises a pseudo-spectrum method, a dynamic programming method and a genetic algorithm.
S2: creating a reward function neural network and initializing parameters;
the reward function neural network is formed by stacking a fully-connected neural network, a convolutional neural network and a long-short term memory neural network in the order of the convolutional neural network, the long-short term memory neural network and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
S3: learning to obtain parameters of the neural network of the reward function by utilizing reverse reinforcement learning;
the reverse reinforcement learning method specifically comprises the following steps:
s3.1, selecting a strategy obtained by random generationPi, then run the strategy to get a series of state-action pair sequences ζ {(s)t,at)},stRepresenting the t-th state, a, in a state-action pairtRepresenting the t-th action in the state-action pair, t representing the serial number of the state-action pair, and then calculating the action value Q through an evaluation neural networkπ
S3.2, using the expert State-action pairs obtained in step S1, according to the formula
Figure BDA0002395921990000071
Calculating expert action values
Figure BDA0002395921990000072
Wherein, thetaTParameter, mu, representing a network of reward functionsπSymbols representing strategies, E representing the mathematical expectation, gammatIndicating the discount rate, St0Indicating the tth of an expert State-action pair0A state; a. thet0Indicating the tth of an expert State-action pair0Step action, and a is the t th dotted finger in the formula0Step expert knowledge of the actions actually performed;
s3.3, by
Figure BDA0002395921990000073
Performing gradient descent update on the parameter theta for the target function; wherein the content of the first and second substances,
Figure BDA0002395921990000074
indicating the state of the first summation to the t-th summation and the second summation to the i-th summation,
Figure BDA0002395921990000075
representing the action of summing the first time to the t-th time, summing the second time to the i-th time,Nindicates the total number of times of the second summation,Lrepresenting the total number of first summations; lambda [ alpha ]1The representation is an empirical constant used for balancing punishment and expectation, epsilon represents an artificially set precision threshold, and i represents a counting serial number of the second summation; if the learned state action pairs are consistent with the expert strategy, the loss function
Figure BDA0002395921990000076
Otherwise
Figure BDA0002395921990000077
S4: establishing an action neural network, evaluating the neural network, and initializing parameters of each network; the action neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure; the evaluation neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure;
the action neural network and the evaluation neural network are all formed by stacking a fully-connected neural network, a convolutional neural network and a long-short term memory neural network in the order of the convolutional neural network, the long-short term memory neural network and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
S5: under one random working condition, acquiring an SOC value s before vehicle interaction;
and the SOC value s before vehicle interaction is obtained through a vehicle model or is obtained through presetting.
S6: inputting the acquired SOC value s before interaction into a reward function neural network to obtain a reward value r;
s7: inputting the acquired SOC value s before interaction into an action neural network, and then, outputting a mixed mode distribution ratio a by the action neural network through processing of a plurality of hidden layers;
the hidden layer comprises a full connection layer, a convolution layer, a long short-term memory layer, a sigmoid layer and a tanh layer, namely the hidden layer is one of the full connection layer, the convolution layer, the long short-term memory layer, the sigmoid layer and the tanh layer.
S8: controlling the vehicle to interact with the environment by using the hybrid mode distribution ratio a obtained in the step S7, and acquiring an SOC value S' after interaction;
s9: combining the SOC value s before interaction, the hybrid mode distribution ratio a, the reward value r and the SOC value s 'after interaction to obtain an experience vector (s, a, r, s'), and then storing the experience vector in a memory buffer;
s10: when the number of the experience vectors in the memory buffer reaches the maximum capacity, randomly extracting a set number of experience vectors from the memory buffer as the input of an evaluation neural network, and then outputting an evaluation value by the evaluation neural network according to a Bellman equation through the processing of a plurality of hidden layers;
the Bellman equation is:
Figure BDA0002395921990000091
wherein r istIndicating a prize value, st+1E represents the state obeying distribution E,
Figure BDA0002395921990000092
denotes the reward as rt, the state obeys the mathematical expectation of the distribution E, r(s)t,at) Represents a state of stThe action is atThe value of the prize of the time of day,
Figure BDA0002395921990000093
representing the mathematical expectation when the action execution policy is pi.
S11: the reward function neural network, the action neural network and the evaluation neural network calculate respective weight gradients, and then parameters of the reward function neural network, the action neural network execution network and the evaluation neural network execution network are updated and executed through back propagation;
the rewarding function neural network, the action neural network and the evaluation neural network calculate respective weight gradients through gradient chain formulas (Learning representation by back-amplifying errors, David E, Nature vol323, 1986).
S12: updating parameters of an execution network of an action neural network and parameters of an execution network of the evaluation neural network by using soft replacement rules, and finishing the updating of the parameters of the action neural network and the evaluation neural network;
s13: after the parameters of the action neural network and the evaluation neural network are updated, repeating the steps S5-S12 until the set maximum step number is reached or the set convergence target is reached, finishing the training of the reward function neural network, the action neural network and the evaluation neural network, and storing the parameters of the neural network after the training is finished;
s14: the method comprises the steps of controlling a controlled object by using a trained reward function neural network, an action neural network and an evaluation neural network, firstly reading parameters of the neural networks after training is finished, then obtaining an SOC value s before interaction, inputting the trained action neural network, and outputting a distribution ratio a of a hybrid power system as a control quantity to control a vehicle.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The embodiments of the present invention are not limited to the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims (8)

1. A hybrid system energy management strategy based on reverse deep reinforcement learning is characterized by comprising the following steps:
s1: calculating a global hybrid mode distribution ratio and a global optimized SOC result by using an optimization solving method under one complete working condition, and forming an expert state-action pair as expert knowledge of reverse reinforcement learning;
s2: creating a reward function neural network and initializing parameters;
s3: the method for learning and obtaining the parameters of the reward function neural network by utilizing the reverse reinforcement learning method specifically comprises the following steps:
s3.1, selecting a strategy pi obtained through random generation, and then operating the strategy to obtain a series of state-action pair sequences zeta { (S)t,at)},stRepresenting the t-th state, a, in a state-action pairtRepresenting the t-th action in the state-action pair, t representing the serial number of the state-action pair, and then calculating the action value Q through an evaluation neural networkπ
S3.2, using the expert State-action pairs obtained in step S1, according to the formula
Figure FDA0003175723950000011
Calculating expert action values
Figure FDA0003175723950000012
Wherein, thetaTParameter, mu, representing a network of reward functionsπSymbol representing strategy, E representing mathematical expectation, gamma representing discount rate, St0Indicating the tth of an expert State-action pair0A state; a. thet0Indicating the tth of an expert State-action pair0Step action, and a is the t th dotted finger in the formula0Step expert knowledge of the actions actually performed;
s3.3, by
Figure FDA0003175723950000013
Performing gradient descent update on the parameter theta for the target function; wherein the content of the first and second substances,
Figure FDA0003175723950000014
indicating the state of the first summation to the t-th summation and the second summation to the i-th summation,
Figure FDA0003175723950000015
representing the action from the first summation to the t-th summation and the second summation to the i-th summation, N representing the total number of summations of the second summation, and L tableShowing the total number of first summations; lambda [ alpha ]1The representation is an empirical constant used to balance penalties with expectations, i represents the count number of the second summation; if the learned state action pairs are consistent with the expert strategy, the loss function
Figure FDA0003175723950000016
Otherwise
Figure FDA0003175723950000017
S4: establishing an action neural network, evaluating the neural network, and initializing parameters of each network; the action neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure; the evaluation neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure;
s5: under one random working condition, acquiring an SOC value s before vehicle interaction;
s6: inputting the acquired SOC value s before interaction into a reward function neural network to obtain a reward value r;
s7: inputting the acquired SOC value s before interaction into an action neural network, and then, outputting a mixed mode distribution ratio a by the action neural network through processing of a plurality of hidden layers;
s8: controlling the vehicle to interact with the environment by using the hybrid mode distribution ratio a obtained in the step S7, and acquiring an SOC value S' after interaction;
s9: combining the SOC value s before interaction, the hybrid mode distribution ratio a, the reward value r and the SOC value s 'after interaction to obtain an experience vector (s, a, r, s'), and then storing the experience vector in a memory buffer;
s10: when the number of the experience vectors in the memory buffer reaches the maximum capacity, randomly extracting a set number of experience vectors from the memory buffer as the input of an evaluation neural network, and then outputting an evaluation value by the evaluation neural network according to a Bellman equation through the processing of a plurality of hidden layers;
s11: the reward function neural network, the action neural network and the evaluation neural network calculate respective weight gradients, and then parameters of the reward function neural network, the action neural network execution network and the evaluation neural network execution network are updated and executed through back propagation;
s12: updating parameters of an execution network of an action neural network and parameters of an execution network of the evaluation neural network by using softreplacement rules, and completing parameter updating of the action neural network and the evaluation neural network;
s13: after the parameters of the action neural network and the evaluation neural network are updated, repeating the steps S5-S12 until the set maximum step number is reached or the set convergence target is reached, finishing the training of the reward function neural network, the action neural network and the evaluation neural network, and storing the parameters of the neural network after the training is finished;
s14: the method comprises the steps of controlling a controlled object by using a trained reward function neural network, an action neural network and an evaluation neural network, firstly reading parameters of the neural networks after training is finished, then obtaining an SOC value s before interaction, inputting the trained action neural network, and outputting a distribution ratio a of a hybrid power system as a control quantity to control a vehicle.
2. The hybrid dynamic system energy management strategy based on the inverse deep reinforcement learning of claim 1, wherein in step S1, the optimization solution method includes pseudo-spectral method, dynamic programming method, and genetic algorithm.
3. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S2, the reward function neural network is composed of a fully-connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
4. The hybrid dynamic system energy management strategy based on the inverse deep reinforcement learning of claim 1, wherein in step S4, the action neural network and the evaluation neural network are composed of a fully connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.
5. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S5, the SOC value S before vehicle interaction is obtained by a vehicle model or is obtained through presetting.
6. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S7, the hidden layer includes a full-connection layer, a convolutional layer, a long-short term memory layer, a sigmoid layer, and a tanh layer, that is, the hidden layer is one of the full-connection layer, the convolutional layer, the long-short term memory layer, the sigmoid layer, and the tanh layer.
7. The hybrid dynamic system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein in step S10, the bellman equation is:
Figure FDA0003175723950000041
wherein r istIndicating a prize value, st+1E represents the state obeying distribution E,
Figure FDA0003175723950000042
denotes the reward as rt, the state obeys the mathematical expectation of the distribution E, r(s)t,at) Represents a state of stThe action is atThe value of the prize of the time of day,
Figure FDA0003175723950000043
representing the mathematical expectation when the action execution policy is pi.
8. The hybrid power system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein in step S11, the reward function neural network, the action neural network, and the evaluation neural network calculate their respective weight gradients through gradient chain formulas.
CN202010131644.XA 2020-02-28 2020-02-28 Hybrid system energy management strategy based on reverse deep reinforcement learning Active CN111367172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010131644.XA CN111367172B (en) 2020-02-28 2020-02-28 Hybrid system energy management strategy based on reverse deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010131644.XA CN111367172B (en) 2020-02-28 2020-02-28 Hybrid system energy management strategy based on reverse deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN111367172A CN111367172A (en) 2020-07-03
CN111367172B true CN111367172B (en) 2021-09-21

Family

ID=71208367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010131644.XA Active CN111367172B (en) 2020-02-28 2020-02-28 Hybrid system energy management strategy based on reverse deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN111367172B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112498334B (en) * 2020-12-15 2022-03-11 清华大学 Robust energy management method and system for intelligent network-connected hybrid electric vehicle
CN113110052B (en) * 2021-04-15 2022-07-26 浙大宁波理工学院 Hybrid energy management method based on neural network and reinforcement learning
CN113071508B (en) * 2021-06-07 2021-08-20 北京理工大学 Vehicle collaborative energy management method and system under DCPS architecture
CN113595426B (en) * 2021-07-06 2022-11-01 华中科技大学 Control method of multilevel converter based on reinforcement learning
CN114047745B (en) * 2021-10-13 2023-04-07 广州城建职业学院 Robot motion control method, robot, computer device, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105644548A (en) * 2015-12-28 2016-06-08 中国科学院深圳先进技术研究院 Energy control method and device for hybrid electric vehicle
CN108177648A (en) * 2018-01-02 2018-06-19 北京理工大学 A kind of energy management method of the plug-in hybrid vehicle based on intelligent predicting
CN108427985A (en) * 2018-01-02 2018-08-21 北京理工大学 A kind of plug-in hybrid vehicle energy management method based on deeply study
CN108909702A (en) * 2018-08-23 2018-11-30 北京理工大学 A kind of plug-in hybrid-power automobile energy management method and system
CN109143867A (en) * 2018-09-26 2019-01-04 上海海事大学 A kind of energy management method of the hybrid power ship based on ANN Control
CN109552079A (en) * 2019-01-28 2019-04-02 浙江大学宁波理工学院 A kind of rule-based electric car energy composite energy management method with Q-learning enhancing study
CN109591659A (en) * 2019-01-14 2019-04-09 吉林大学 A kind of pure electric automobile energy management control method of intelligence learning
CN110254418A (en) * 2019-06-28 2019-09-20 福州大学 A kind of hybrid vehicle enhancing study energy management control method
CN110341690A (en) * 2019-07-22 2019-10-18 北京理工大学 A kind of PHEV energy management method based on deterministic policy Gradient learning
CN110406526A (en) * 2019-08-05 2019-11-05 合肥工业大学 Parallel hybrid electric energy management method based on adaptive Dynamic Programming

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105644548A (en) * 2015-12-28 2016-06-08 中国科学院深圳先进技术研究院 Energy control method and device for hybrid electric vehicle
CN108177648A (en) * 2018-01-02 2018-06-19 北京理工大学 A kind of energy management method of the plug-in hybrid vehicle based on intelligent predicting
CN108427985A (en) * 2018-01-02 2018-08-21 北京理工大学 A kind of plug-in hybrid vehicle energy management method based on deeply study
CN108909702A (en) * 2018-08-23 2018-11-30 北京理工大学 A kind of plug-in hybrid-power automobile energy management method and system
CN109143867A (en) * 2018-09-26 2019-01-04 上海海事大学 A kind of energy management method of the hybrid power ship based on ANN Control
CN109591659A (en) * 2019-01-14 2019-04-09 吉林大学 A kind of pure electric automobile energy management control method of intelligence learning
CN109552079A (en) * 2019-01-28 2019-04-02 浙江大学宁波理工学院 A kind of rule-based electric car energy composite energy management method with Q-learning enhancing study
CN110254418A (en) * 2019-06-28 2019-09-20 福州大学 A kind of hybrid vehicle enhancing study energy management control method
CN110341690A (en) * 2019-07-22 2019-10-18 北京理工大学 A kind of PHEV energy management method based on deterministic policy Gradient learning
CN110406526A (en) * 2019-08-05 2019-11-05 合肥工业大学 Parallel hybrid electric energy management method based on adaptive Dynamic Programming

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Multi-Objective Optimization of Energy Management Strategy on Hybrid Energy Storage System Based on Radau Pseudospectral Method;Yanwei Liu 等;《IEEE Access》;20190813;第112483-112493页 *
基于Radau伪谱法的复合电源电动汽车能量管理策略优化;刘延伟 等;《汽车工程》;20190630;第41卷(第6期);第625-633页 *
插电式混合动力公交能量管理与优化控制研究;王天元;《中国优秀硕士学位论文全文数据库(电子期刊)》;20191215;第C035-296页 *
深度逆向强化学习研究综述;陈希亮 等;《计算机工程与应用》;20180531;第54卷(第5期);第24-35页 *

Also Published As

Publication number Publication date
CN111367172A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111367172B (en) Hybrid system energy management strategy based on reverse deep reinforcement learning
CN111731303B (en) HEV energy management method based on deep reinforcement learning A3C algorithm
CN110341690B (en) PHEV energy management method based on deterministic strategy gradient learning
CN111267831B (en) Intelligent time-domain-variable model prediction energy management method for hybrid electric vehicle
CN108427985B (en) Plug-in hybrid vehicle energy management method based on deep reinforcement learning
CN113051667B (en) Accelerated learning method for energy management strategy of hybrid electric vehicle
CN113570039B (en) Block chain system based on reinforcement learning optimization consensus
CN112700060B (en) Station terminal load prediction method and prediction device
CN113110052B (en) Hybrid energy management method based on neural network and reinforcement learning
CN113537580B (en) Public transportation passenger flow prediction method and system based on self-adaptive graph learning
CN113592162B (en) Multi-agent reinforcement learning-based multi-underwater unmanned vehicle collaborative search method
CN116187787B (en) Intelligent planning method for cross-domain allocation problem of combat resources
CN114970351A (en) Power grid flow adjustment method based on attention mechanism and deep reinforcement learning
CN112765723A (en) Curiosity-driven hybrid power system deep reinforcement learning energy management method
CN116841317A (en) Unmanned aerial vehicle cluster collaborative countermeasure method based on graph attention reinforcement learning
CN114819316A (en) Complex optimization method for multi-agent task planning
CN117131606A (en) Hybrid power tracked vehicle energy management method capable of transferring across motion dimension
CN112084700A (en) Hybrid power system energy management method based on A3C algorithm
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
CN116579375A (en) Data-driven unit combination decision method, system, equipment and medium
CN114840024A (en) Unmanned aerial vehicle control decision method based on context memory
CN112100909B (en) Parallel configurable intelligent optimization method based on collaborative optimization strategy
CN114219274A (en) Workshop scheduling method adapting to machine state based on deep reinforcement learning
Kim et al. Convergence of multiagent Q-learning: Multi action replay process approach
CN105512754A (en) Conjugate prior-based single-mode distribution estimation optimization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant